Meta is no different. I know a company that had their OAuth app on Meta rendered completely unusable just because one of their employees (a dev) had their personal Facebook account banned by Meta for no reason. They tried to escalate it multiple times but got nowhere, lol. Meta is even worse because accounts need to be 'personal'; if you have a Business Manager, the users added to it are all tied to their personal Meta/Facebook accounts. This is ludicrous.
Yeah, people loose their business because a kid is logged in on their iPad, gets their google account suspended, and google knows it's the same household as the parent, and everything gets shut down
Meta and Google B2B are both horrible. Their ad account bans are constant, and they have no real escalation process to get help. These companies are monopolies that should treat businesses more seriously, especially in these situations.
More businesses need to hear this message. Google have proven time and time again they cannot be trusted as a service provider, exactly because of this problem.
They have not explained WHY their account was suspended. That's the most important part, imo. Cloud Providers don't suspend entire accounts for no reason.
In practice, Google has earned the way my priors are ready to believe it's 100% their fault with mighty and sustained effort. Or lack thereof, depending on your point of view.
Railway don't have a great reputation for building scalable systems (effects of vibe coding?). It's worth waiting for Google's response before jumping to conclusions. They can move to Azure/AWS/own datacenter, but there's a good chance this will repeat in a few months.
Never could. Google might block your entire company because one of your workers did something nasty on their personal account, and their ban hammer is mighty and blocks all related accounts to the Nth degree
This should be a warning to anyone running GCP. They suspend accounts left right and centre without even thinking about what they're doing. It seems like they use Gemini 3.1 Pro to run their production decisions.
TK has a history of absolutely destroying the culture of the place like in OCI and has done something similar in GCP from what I've heard. GCP and Google are completely different entities with how they work. Don't expect Google quality from the name. It's just like those old brands which now have cheap licensed products like Nokia (An exaggeration I know but not far from truth).
Not only that they are known to shut off their services randomly giving you like 6 months to migrate. They have lots of engineers not doing anything, so they put them on migrating internal users off those services, most of their clients don't. There was a brilliant article on this by an ex-GCP employee that I can't find right now.
Avoid GCP like plague if you are serious about your business.
And this is Railway, a big enough name to top the HN main page and presumably find someone from Google to intervene at some point. I would have zero recourse if it was some little product that I built.
The report at this point is pretty much just a timeline of what happened. No explanation of why, no accusations, no blame. A PR piece, to Railway's customers, reassuring them that "we're not ignoring this."
Now the lawyers are huddling. IMO there won't be a lot more said publicly by either side, at least until any threat of lawsuits for damages is settled.
I don't think you're typically told why for these things, and it's mostly automated from what I can tell. The automated systems make mistakes but more importantly they're completely opaque. Nobody, not even Google, knows how they work exactly.
Railway has not had the best month in the tech press have they? And in both cases it was an automated process belonging to some other party that put them there, damaging their reputation.
I was going to talk to our google rep about their killing the Gemini cli but this is way more concerning.
Building on someone else's platform is always gonna be a risky move, and building a platform on top of someone else's platform is even riskier.
My company used to use a hosting provider that was basically AWS plus some extra guarantees. We just finished migrating onto regular AWS because they now offer what we need directly.
But...AWS is a platform too, no? Seems like you're in the same category of risk you just moved to a more well-known name. Granted, Amazon is the most reliable even if they have their own quirks.
Unfortunately we had to make emergency migration off to Azure yesterday due to this. Thankfully our DB was not hosted on Railway and we were back up in a couple hours.
As much as we loved the simplicity they provided us, there's just been too many mishaps and shortcomings for us to continue running a B2B enterprise app on their infrastructure.
Question: for a smaller SaaS tool, or even internal product. If a team doesn't want to manage AWS or another IaaS provider, what are the best alternatives for the following
An intermediary can provide value but there’s also a risk so I’d consider why you don’t want to use AWS, GCP, etc. directly. All of the major cloud providers have services which are only slightly harder than what Railway does but allow you to grow into more advanced things as your needs expand without adding a third-party who controls your features, security, and availability.
As an example, I note that GCP responded within 7 minutes according to their timeline. If you’d been using Cloud Run, that would have reduced downtime by over 7 hours — and there’s a good chance that you never would have gone down in the first place if the unknown trigger event was related to other customer activity or something odd Railway did.
There’s also a complexity factor: note how much complex infrastructure they mentioned having to fix that you wouldn’t need for your own account. That code does useful things, I’m sure, but it’s also a lot of moving parts which a hosting provider needs and you don’t – this outage took everyone down, whereas individual AWS or bare metal users would’ve otherwise been unaffected. There isn’t a global optimum which is the same for everyone but I think developers are prone to wildly over-estimating how much time they save by removing a couple of deployment steps relative to the direct costs and the less obvious costs of working within someone else’s environment.
If you are unable to use IaaS directly. You need to accept that your service might be down.
Even if you use AWS and the like, if you aren't building your app with redundancy across multiple AZs, then you'll have some downtime occasionally.
And even if you do build redundancy with multiple AZ, some services might fail anyway as AWS is not entirely isolated. So you might have downtimes.
So just accept downtimes and use the best tool for you (unless they are really bad, like GitHub level bad).
If you cannot accept any downtime, you'll have to spend millions of dollars and months of work to have the confidence to expect no downtime. Something like Netflix's chaos monkey and infrastructure would be enough.
I have read plenty of snark about them on HN, but I found their product incredibly useful, well-designed, and easy to work with. If I was building a new startup from scratch, I'd definitely be giving them a look.
I'm sure there are plenty of the like 1,000 AWS products that DO has no viable competitor for, but for what they do offer, they're great.
Yup. I don't know enough people at giant companies to know how many actually do this though. Not just talking having 2 AZs, I'm talking about ability in a DR scenario to fail over, within 5-10 minutes, to a different cloud provider, e.g. AWS → Hetzner, or GCP → Azure.
My gut feeling is that the number of significant applications that have this capability can probably be counted on two hands. Especially since a lot of the largest footprints of software stacks running in the cloud belong to Google and Microsoft, who I'm pretty sure do not replicate their services into someone else's cloud.
But really any service (or even on-site hosting) can have downtime, if that's not acceptable then I suppose building/using a tool that can be distributed between multiple hosts located in different geographical areas is the best option.
Haven't used railway but my understanding is they are something similar to Heroku. Fly.io has been pretty great for tiny projects in that niche.
For Vercel if your nextjs site can be compiled statically you could probably throw it up on almost anything. We've self hosted before which is pretty straightforward but you lose a lot of the image optimization stuff unless you go deep into setting up open next.
Depending on exactly what you're building, all of these things sounds like one VPS. A bit of maintenance/security burden managing the machine if you're not used to it but as the others have said: Next.js can be selfhosted, unless you need the serverless/edge stuff; then I would go to Cloudflare Workers.
"Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it." - Thanks Claude!
What drives Google to apply these actions so completely and immediately, versus a more deliberate approach, with notification and delay before action, manual review for paying customers, or a warning to resolve within X hours/days? Once or twice could be errors or bad implementation, but these can't explain away the pattern.
It would seem that Google's counsel has deemed that whenever _____ is detected, the company must immediately and completely sever the business relationship. What is that driving concern? Is it sanctions enforcement? CSAM? Something else?
The problem is scale. Google uses automation and doesn't have the people to review the actions of that automation. I never worked at Google but this is the most obvious explanation from watching these things happen for years and years.
Please, someone that worked at Google, please comment.
> May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
> At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action.
If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?
The simplest explanation is just that one or the other of these timestamps is wrong, which wouldn't be a big deal. But if the timestamps aren't known with certainty, it seems very odd to include them in the writeup as though they are certain, even though they are very obviously inconsistent with each other.
> If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?
Assuming the timestamps are accurate, Google probably started terminating resources while the account was not "suspended" and only completed that after all resources were disabled.
Or the account started doing something nefarious (assuming one of their customers as root cause, not railway itself) that started causing real problems and Google shut it down.
The problem with not having the data is that it’s easy to make assumptions.
The absence of any explanation for the suspension does seem intentional. If it were me that's one of the first things I would've asked so that I could make sure it doesn't happen again.
The 22:20 timestamp from the body of the post is wrong. The timeline section (where the 22:10 timestamp came from) is consistent with itself, and also contains:
> May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
They couldn't have identified the root cause before it happened.
That 10 minutes is likely very normal. Possibly...
* A Google employee messes up a setting (like one of the previous incidents) triggers something that looks like a suspension is warranted and it takes 10 minutes to flow through the process to suspend.
* A Railway customer does something corrupt, or seemingly corrupt, Google's system starts limiting access and take 10 minutes to decide it should be a suspension.
These are even more likely if there is a person in the loop to approve, who obvious did not dig deep enough to see that they should not have done so.
Even if it ultimately turns out to be "Google's fault" (as this report seems to be saying), Railway say they own the incident but make no apology here.
I've read all the threads and their main page and I still don't really understand what this service is. Is this like a commercial alternative to Gerrit? What do people use this for?
Here is my source code
Run it on the cloud for me
I do not care how
In this case it looks like they also bundle together a bunch of the other services you would need to get code onto the platform, monitor it once it’s there and so on
Oh I see, so they manage the server hosting and application server configuration, optimization and all that jazz. Almost like one step away from managed hosting. Makes sense now, thankyou!
Honestly, i have been wanting to suggest to my leaders that we should go to on-prem for primary, and use cloud only as extra for peak traffic and/or failover, etc...but, the culture where i'm at is so bought into cloud as if it solves all problems...and then, in the next breath they all ask me to drastically reduce cloud costs and ensure 100% uptime at all times 24/7/365 (1005 uptime without complexity and without any added costs!).
I think this is just the default endgame of large corporates which suck up large quantities of customers. They are a race to the bottom and you end up with service by footgun. My own company is responsible for doing this in our sector. Literally every technology decision favours automation over verification because it's cheaper to say sorry than do it right.
Amazon played AWS from day 1 as if they were the runner-up (and in a sense they were), and while it does look like it's day 2 there, they are not letting the momentum down
Microsoft might have technical warts but commercially they are strong and Azure is a lot of times bundled with other services and you know you can get someone on the phone if needed
Honestly they really are starting to look that way. Total opinionated Walled Garden that's against an open and thriving ecosystem. Unlike Microsoft the technology is not yet garbage but I hope this isn't where they're going to end up
It sure is heading towards being garbage, though. Search is actively being degraded in favor of a barely functioning AI, and I'm sure it's not going to stop there. Seems like it was inevitable once ad/finance people got ahold of the company.
> Your customers don't care whether the failure was Google or Railway; they see your product.
Refreshing. So tired of businesses blaming their vendors. Oh it wasn't us spamming you text messages and emails, it was Shopify. Oh, our delivery guarantee said 2 days and it's been a week? That's not us, it's UPS.
I don't care. I didn't pay UPS or Shopify. I paid you.
It really is amazing that there is not some level at which "human review" becomes mandatory. Customers of that size already have dedicated account rep contacts.
I can't believe Kurian has not put his foot down about this. Adverse action against accounts over $X ARR absolutely must have review by revenue-carrying people before the action is taken.
> Railway’s production account into a suspended status incorrectly, as part of an automated action.
Be it individuals or companies, this time is the best time to ditch all dependence on anything clouds or SaaS since all are using automated AI, more and more of these incidents will occur.
Google has a culture problem. This is not something that can change easily nor will it change when it’s not recognized as being an issue within their organization.
Between my peer c-suites, the conversation is that GCP cannot even be in the consideration set until such a time as a several-year period has elapsed without this kind of incident.
So, what was the reason for the account suspension. Why did it happen? I know Google can be a bit stupid with their automatons but I am bit skeptical here. There are sites more critical than Railway hosted on GCP.
I've been getting serious, recently, about moving all my workloads to equipment that I control in datacenters with which I have professional relationships. It's less expensive, easier, and this kind of nonsense doesn't happen. These cloud providers need to step back and observe how terrible they've made these products. Footguns everywhere, pricing that is impossible to forecast or reason about, broken APIs, and automated self destruction. Then you have third-party providers sitting on top of them, adding another layer of each antifeature. Crazy.
> ...These cloud providers need to step back and observe how terrible they've made these products...
I doubt that will happen because none of them want to stop the money-making machine they have! And, if your thought after my comment is that all us techies are making a fuss, so the cloud providers and businesses using them will hear our cries and trigger a backlash...? I doubt that to...because some senior business leaders that i see are bent on listening more to management consultants as opposed to abalance of folks including their own internal experts...but, alas, maybe i'm just having too cynical a day today. :-)
It's really surprising how much cheaper colo becomes if you have an even vaguely predictable workload. And you don't have to be a major customer, either -- the data centers will happily sell you single U's or a couple U's, even on a monthly basis if you ask, making it perfectly viable for startups or advanced personal projects.
> These cloud providers need to step back and observe how terrible they've made these products.
They don't, because the allure of effortless scaling is hard to resist: everyone thinks of themselves as the next tech unicorn. And if you actually become an unicorn, you're already too dependent on AWS / Azure / GCP to easily move somewhere else. At best, your strategy is to become "multi-cloud".
That effortlessness is a fantasy. That's illustrated right here in this write-up by how complicated their system is.
>Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud
The thing that's nice about physical datacenters with people is that they often have to physically walk over to disconnect you - it's not as easy as some automated system doing an AI.
"Finally, we are in planning to remove Google Cloud services from our data plane’s hot path, and keeping them only for secondary/failover."
That's pretty clear. Google can no longer be trusted as a B2B service provider.
Meta is no different. I know a company that had their OAuth app on Meta rendered completely unusable just because one of their employees (a dev) had their personal Facebook account banned by Meta for no reason. They tried to escalate it multiple times but got nowhere, lol. Meta is even worse because accounts need to be 'personal'; if you have a Business Manager, the users added to it are all tied to their personal Meta/Facebook accounts. This is ludicrous.
Yeah, people loose their business because a kid is logged in on their iPad, gets their google account suspended, and google knows it's the same household as the parent, and everything gets shut down
Everyone needs a defensible root of trust, this goes all the way down to the registrar you use for your domain.
Meta and Google B2B are both horrible. Their ad account bans are constant, and they have no real escalation process to get help. These companies are monopolies that should treat businesses more seriously, especially in these situations.
Who said anything about meta? Is meta selling compute to companies? Why even bring them up?
Seems relevant to me as it is still a service that their company relied on.
They trust them enough to still give them money, just goes to show how entrenched big tech is and why they need to be broken up into dozens of pieces.
More businesses need to hear this message. Google have proven time and time again they cannot be trusted as a service provider, exactly because of this problem.
They have not explained WHY their account was suspended. That's the most important part, imo. Cloud Providers don't suspend entire accounts for no reason.
> Cloud Providers don't suspend entire accounts for no reason.
You're joking, right?
The cloud provider in question - GCP - who also deleted a 125 billion dollar company's entire account on accident?
Unfortunately the cloud providers also rarely if ever tell you the reason.
Railway has an overwhelming incentive to pin the blame on Google. This report doesn't answer why Google suspended Railway's account.
I'd wait for more details before adjudicating.
In principle, I agree with you.
In practice, Google has earned the way my priors are ready to believe it's 100% their fault with mighty and sustained effort. Or lack thereof, depending on your point of view.
They said it was automated and affected a bunch of other customers, which gives at least some hint.
And in general Google lost any immediate benefit of the doubt status many years ago. Many such stories.
Railway don't have a great reputation for building scalable systems (effects of vibe coding?). It's worth waiting for Google's response before jumping to conclusions. They can move to Azure/AWS/own datacenter, but there's a good chance this will repeat in a few months.
Never could. Google might block your entire company because one of your workers did something nasty on their personal account, and their ban hammer is mighty and blocks all related accounts to the Nth degree
This should be a warning to anyone running GCP. They suspend accounts left right and centre without even thinking about what they're doing. It seems like they use Gemini 3.1 Pro to run their production decisions.
TK has a history of absolutely destroying the culture of the place like in OCI and has done something similar in GCP from what I've heard. GCP and Google are completely different entities with how they work. Don't expect Google quality from the name. It's just like those old brands which now have cheap licensed products like Nokia (An exaggeration I know but not far from truth).
Not only that they are known to shut off their services randomly giving you like 6 months to migrate. They have lots of engineers not doing anything, so they put them on migrating internal users off those services, most of their clients don't. There was a brilliant article on this by an ex-GCP employee that I can't find right now.
Avoid GCP like plague if you are serious about your business.
And this is Railway, a big enough name to top the HN main page and presumably find someone from Google to intervene at some point. I would have zero recourse if it was some little product that I built.
All google products work like this. Should never be used for anything critical.
The interesting and yet-to-be-explained part is why google flagged the account?
Put all the timestamps you want in the post mortem about what you observed, but you haven't addressed the root cause.
The "this doesn't make sense" part of the story likely has a real explanation that nobody wants to reveal yet.
Shouldn't Google answer this if they are unhappy with this incident report? Are we even sure that Railway knows?
I seriously doubt Railway knows. That's the MO for Google and others, suspend account without explanation.
The report at this point is pretty much just a timeline of what happened. No explanation of why, no accusations, no blame. A PR piece, to Railway's customers, reassuring them that "we're not ignoring this."
Now the lawyers are huddling. IMO there won't be a lot more said publicly by either side, at least until any threat of lawsuits for damages is settled.
They can't - that would violate the privacy rights of their customer.
They need to tell Railway and Railway needs to tell us, or Railway can tell us that Google is refusing to tell them.
Either way, we need to hear about this from Railway.
I don't think you're typically told why for these things, and it's mostly automated from what I can tell. The automated systems make mistakes but more importantly they're completely opaque. Nobody, not even Google, knows how they work exactly.
Google should know why a human accepted the automated suggestion, or if and why there wasn't any human oversight in the first place.
That‘s the point where Google tells you they won’t tell you the exact reason because of security reasons
Exactly this, which is the problem with all modern accounts. No person to talk to so you can understand what happened and maybe fix it.
They most definitely have a person to talk to. They're not the largest Google Cloud user by far, but they are large enough to have human account reps.
And those reps might not be told what the reason is.
They also don't want to tell you because then they have to put rules and cannot ban people arbitrarily.
Giving reasons is putting accountability on Google and they don't want that.
This isn’t the first time Google Cloud has seriously messed with a customer’s account: https://cloud.google.com/blog/products/infrastructure/detail...
Railway has not had the best month in the tech press have they? And in both cases it was an automated process belonging to some other party that put them there, damaging their reputation.
I was going to talk to our google rep about their killing the Gemini cli but this is way more concerning.
Building on someone else's platform is always gonna be a risky move, and building a platform on top of someone else's platform is even riskier.
My company used to use a hosting provider that was basically AWS plus some extra guarantees. We just finished migrating onto regular AWS because they now offer what we need directly.
But...AWS is a platform too, no? Seems like you're in the same category of risk you just moved to a more well-known name. Granted, Amazon is the most reliable even if they have their own quirks.
Each critical dependency you stack multiplies your risk. Now you have to worry about Railway AND Google causing business-damaging outages.
Unfortunately we had to make emergency migration off to Azure yesterday due to this. Thankfully our DB was not hosted on Railway and we were back up in a couple hours.
As much as we loved the simplicity they provided us, there's just been too many mishaps and shortcomings for us to continue running a B2B enterprise app on their infrastructure.
Sad day :(
Azure suspended your account as well?
Question: for a smaller SaaS tool, or even internal product. If a team doesn't want to manage AWS or another IaaS provider, what are the best alternatives for the following
1.) Vercel - having a bad month
2.) Supabase - having a bad month
3.) Railway - now having a bad month
An intermediary can provide value but there’s also a risk so I’d consider why you don’t want to use AWS, GCP, etc. directly. All of the major cloud providers have services which are only slightly harder than what Railway does but allow you to grow into more advanced things as your needs expand without adding a third-party who controls your features, security, and availability.
As an example, I note that GCP responded within 7 minutes according to their timeline. If you’d been using Cloud Run, that would have reduced downtime by over 7 hours — and there’s a good chance that you never would have gone down in the first place if the unknown trigger event was related to other customer activity or something odd Railway did.
There’s also a complexity factor: note how much complex infrastructure they mentioned having to fix that you wouldn’t need for your own account. That code does useful things, I’m sure, but it’s also a lot of moving parts which a hosting provider needs and you don’t – this outage took everyone down, whereas individual AWS or bare metal users would’ve otherwise been unaffected. There isn’t a global optimum which is the same for everyone but I think developers are prone to wildly over-estimating how much time they save by removing a couple of deployment steps relative to the direct costs and the less obvious costs of working within someone else’s environment.
If you are unable to use IaaS directly. You need to accept that your service might be down.
Even if you use AWS and the like, if you aren't building your app with redundancy across multiple AZs, then you'll have some downtime occasionally.
And even if you do build redundancy with multiple AZ, some services might fail anyway as AWS is not entirely isolated. So you might have downtimes.
So just accept downtimes and use the best tool for you (unless they are really bad, like GitHub level bad). If you cannot accept any downtime, you'll have to spend millions of dollars and months of work to have the confidence to expect no downtime. Something like Netflix's chaos monkey and infrastructure would be enough.
DigitalOcean. Seriously. They have been around a long long time and built a lot of the core infrastructure you rely on every day (e.g. Ceph).
I have read plenty of snark about them on HN, but I found their product incredibly useful, well-designed, and easy to work with. If I was building a new startup from scratch, I'd definitely be giving them a look.
I'm sure there are plenty of the like 1,000 AWS products that DO has no viable competitor for, but for what they do offer, they're great.
I;ve had my share of VPS & Managed DB outages at DO, so they are also not faultless.
I've been with DO since checks mailbox 2014. Honestly never experienced an unannounced outage.
Yeah overall they are ok. I think 3 times managed db and one or twice a vps just dead. No issues in a year or so
Not if, but when. No one is faultless. Chasing after 100% is a fool's errand.
I think the message here is that you can't trust any single cloud provider. You at least need two with full operational capability.
Yup. I don't know enough people at giant companies to know how many actually do this though. Not just talking having 2 AZs, I'm talking about ability in a DR scenario to fail over, within 5-10 minutes, to a different cloud provider, e.g. AWS → Hetzner, or GCP → Azure.
My gut feeling is that the number of significant applications that have this capability can probably be counted on two hands. Especially since a lot of the largest footprints of software stacks running in the cloud belong to Google and Microsoft, who I'm pretty sure do not replicate their services into someone else's cloud.
Maybe a VPS? Simple to manage and way cheaper.
But really any service (or even on-site hosting) can have downtime, if that's not acceptable then I suppose building/using a tool that can be distributed between multiple hosts located in different geographical areas is the best option.
Haven't used railway but my understanding is they are something similar to Heroku. Fly.io has been pretty great for tiny projects in that niche.
For Vercel if your nextjs site can be compiled statically you could probably throw it up on almost anything. We've self hosted before which is pretty straightforward but you lose a lot of the image optimization stuff unless you go deep into setting up open next.
Depending on exactly what you're building, all of these things sounds like one VPS. A bit of maintenance/security burden managing the machine if you're not used to it but as the others have said: Next.js can be selfhosted, unless you need the serverless/edge stuff; then I would go to Cloudflare Workers.
Fly, Render, and even Heroku still are all better choices then working with Railway I think
Hetzner (or any VM provider) + Dokku works best.
Shameless self plug but check out: https://specific.dev (especially if you use coding agents)
No code lock-in through SDKs and built on top of AWS with great DX for both developer and coding agents
"Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it." - Thanks Claude!
What drives Google to apply these actions so completely and immediately, versus a more deliberate approach, with notification and delay before action, manual review for paying customers, or a warning to resolve within X hours/days? Once or twice could be errors or bad implementation, but these can't explain away the pattern.
It would seem that Google's counsel has deemed that whenever _____ is detected, the company must immediately and completely sever the business relationship. What is that driving concern? Is it sanctions enforcement? CSAM? Something else?
The problem is scale. Google uses automation and doesn't have the people to review the actions of that automation. I never worked at Google but this is the most obvious explanation from watching these things happen for years and years.
Please, someone that worked at Google, please comment.
> May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
> At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action.
If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?
The simplest explanation is just that one or the other of these timestamps is wrong, which wouldn't be a big deal. But if the timestamps aren't known with certainty, it seems very odd to include them in the writeup as though they are certain, even though they are very obviously inconsistent with each other.
> If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?
Assuming the timestamps are accurate, Google probably started terminating resources while the account was not "suspended" and only completed that after all resources were disabled.
Or the account started doing something nefarious (assuming one of their customers as root cause, not railway itself) that started causing real problems and Google shut it down.
The problem with not having the data is that it’s easy to make assumptions.
The absence of any explanation for the suspension does seem intentional. If it were me that's one of the first things I would've asked so that I could make sure it doesn't happen again.
The 22:20 timestamp from the body of the post is wrong. The timeline section (where the 22:10 timestamp came from) is consistent with itself, and also contains:
> May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
They couldn't have identified the root cause before it happened.
That 10 minutes is likely very normal. Possibly...
* A Google employee messes up a setting (like one of the previous incidents) triggers something that looks like a suspension is warranted and it takes 10 minutes to flow through the process to suspend.
* A Railway customer does something corrupt, or seemingly corrupt, Google's system starts limiting access and take 10 minutes to decide it should be a suspension.
These are even more likely if there is a person in the loop to approve, who obvious did not dig deep enough to see that they should not have done so.
Even if it ultimately turns out to be "Google's fault" (as this report seems to be saying), Railway say they own the incident but make no apology here.
They forgot to get reimbursement for downtime. A free month of GCP is better than nothing.
I've read all the threads and their main page and I still don't really understand what this service is. Is this like a commercial alternative to Gerrit? What do people use this for?
I'm not a developer, just curious what this is.
The category is “Platform as a Service”
Alternative to Fly or Heroku
Here is my source code Run it on the cloud for me I do not care how
In this case it looks like they also bundle together a bunch of the other services you would need to get code onto the platform, monitor it once it’s there and so on
Oh I see, so they manage the server hosting and application server configuration, optimization and all that jazz. Almost like one step away from managed hosting. Makes sense now, thankyou!
I will definitely not be signing up on GCP because of this.
Had similar experience with GCP. Terminated VMs six times, and responded zero times.
Duplicate of:
https://news.ycombinator.com/item?id=48201484
Looks like it was endorsed by dang: https://news.ycombinator.com/item?id=48210941
back to on-prem
Honestly, i have been wanting to suggest to my leaders that we should go to on-prem for primary, and use cloud only as extra for peak traffic and/or failover, etc...but, the culture where i'm at is so bought into cloud as if it solves all problems...and then, in the next breath they all ask me to drastically reduce cloud costs and ensure 100% uptime at all times 24/7/365 (1005 uptime without complexity and without any added costs!).
Google, the new Microsoft!
I think this is just the default endgame of large corporates which suck up large quantities of customers. They are a race to the bottom and you end up with service by footgun. My own company is responsible for doing this in our sector. Literally every technology decision favours automation over verification because it's cheaper to say sorry than do it right.
Amazon played AWS from day 1 as if they were the runner-up (and in a sense they were), and while it does look like it's day 2 there, they are not letting the momentum down
Microsoft might have technical warts but commercially they are strong and Azure is a lot of times bundled with other services and you know you can get someone on the phone if needed
Google has... ?
> Google has... ?
former Oracle salespeople
Honestly they really are starting to look that way. Total opinionated Walled Garden that's against an open and thriving ecosystem. Unlike Microsoft the technology is not yet garbage but I hope this isn't where they're going to end up
It sure is heading towards being garbage, though. Search is actively being degraded in favor of a barely functioning AI, and I'm sure it's not going to stop there. Seems like it was inevitable once ad/finance people got ahold of the company.
> Your customers don't care whether the failure was Google or Railway; they see your product.
Refreshing. So tired of businesses blaming their vendors. Oh it wasn't us spamming you text messages and emails, it was Shopify. Oh, our delivery guarantee said 2 days and it's been a week? That's not us, it's UPS.
I don't care. I didn't pay UPS or Shopify. I paid you.
It's reassuring to know they will ban a million dollar enterprise customer just like they will ban your GMail of 20 years.
It really is amazing that there is not some level at which "human review" becomes mandatory. Customers of that size already have dedicated account rep contacts.
I can't believe Kurian has not put his foot down about this. Adverse action against accounts over $X ARR absolutely must have review by revenue-carrying people before the action is taken.
> Railway’s production account into a suspended status incorrectly, as part of an automated action.
Be it individuals or companies, this time is the best time to ditch all dependence on anything clouds or SaaS since all are using automated AI, more and more of these incidents will occur.
Google has a culture problem. This is not something that can change easily nor will it change when it’s not recognized as being an issue within their organization.
Between my peer c-suites, the conversation is that GCP cannot even be in the consideration set until such a time as a several-year period has elapsed without this kind of incident.
Flagged by some AI automation.
So, what was the reason for the account suspension. Why did it happen? I know Google can be a bit stupid with their automatons but I am bit skeptical here. There are sites more critical than Railway hosted on GCP.
I've been getting serious, recently, about moving all my workloads to equipment that I control in datacenters with which I have professional relationships. It's less expensive, easier, and this kind of nonsense doesn't happen. These cloud providers need to step back and observe how terrible they've made these products. Footguns everywhere, pricing that is impossible to forecast or reason about, broken APIs, and automated self destruction. Then you have third-party providers sitting on top of them, adding another layer of each antifeature. Crazy.
> ...These cloud providers need to step back and observe how terrible they've made these products...
I doubt that will happen because none of them want to stop the money-making machine they have! And, if your thought after my comment is that all us techies are making a fuss, so the cloud providers and businesses using them will hear our cries and trigger a backlash...? I doubt that to...because some senior business leaders that i see are bent on listening more to management consultants as opposed to abalance of folks including their own internal experts...but, alas, maybe i'm just having too cynical a day today. :-)
It's really surprising how much cheaper colo becomes if you have an even vaguely predictable workload. And you don't have to be a major customer, either -- the data centers will happily sell you single U's or a couple U's, even on a monthly basis if you ask, making it perfectly viable for startups or advanced personal projects.
> These cloud providers need to step back and observe how terrible they've made these products.
They don't, because the allure of effortless scaling is hard to resist: everyone thinks of themselves as the next tech unicorn. And if you actually become an unicorn, you're already too dependent on AWS / Azure / GCP to easily move somewhere else. At best, your strategy is to become "multi-cloud".
That effortlessness is a fantasy. That's illustrated right here in this write-up by how complicated their system is.
>Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud
What the hell is even that?
The thing that's nice about physical datacenters with people is that they often have to physically walk over to disconnect you - it's not as easy as some automated system doing an AI.
And if they do, you can walk over there too and ask a human why in person. (Or just call the NOC)
Related discussion during the incident:
https://news.ycombinator.com/item?id=48201484
Perfect reminder that it's time to use Google Takeout while I still can.