I don't find the wording in the RFC to be that ambiguous actually.
> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".
The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit order ambiguity everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.
Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.
All of which also serves to highlight the value of normative language, but that came later.
I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.
And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.
Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.
Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.
"With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody."
combined with failure to follow Postel's Law:
"Be conservative in what you send, be liberal in what you accept."
What's reasonable is: "Set reserved fields to 0 when writing and ignore them when reading." (I heard that was the original example). Or "Ignore unknown JSON keys" as a modern equivalent.
What's harmful is: Accept an ill defined superset of the valid syntax and interpret it in undocumented ways.
Very much so. A better law would be conservative in both sending and accepting, as it turns out that if you are liberal in what you accept, senders will choose to disobey Postel's law and be liberal in what they send, too.
"Warnings" are like the most difficult thing to 'send' though. If an app or service doesn't outright fail, warnings can be ignored. Even if not ignored... how do you properly inform? A compiler can spit out warnings to your terminal, sure. Test-runners can log warnings. An RPC service? There's no standard I'm aware of. And DNS! Probably even worse. "Yeah, your RRs are out of order but I sorted them for you." where would you put that?
The Python 3 community was famously divided on that matter, wrt Python 3. Now that it is over, most people on the "accept liberally" side of the fence have jumped sides.
It's remarkable that the ordinary DNS lookup function in glibc doesn't work if the records aren't in the right order. It's amazing to me we went 20+ years without that causing more problems. My guess is most people publishing DNS records just sort of knew that the order mattered in practice, maybe figuring it out in early testing.
I think it's more of a server side ordering, in which there were not that many DNS servers out there, and the ones that didn't keep it in order quickly changed the behavior because of interop.
Now that I have seemingly taken on managing DNS at my current company I have seen several inadequacies of DNS that I was not aware of before. Main one being that if an upstream DNS server returns SERVFAIL, there is no distinction really between if the server you are querying is failed, or the actual authoritative server upstream is broken (I am aware of EDEs but doesn't really solve this). So clients querying a broken domain will retry each of their configured DNS servers, and our caching layer (Unbound) will also retry each of their upstreams etc... Results in a bunch of pointless upstream queries like an amplification attack. Also have issue with the search path doing stupid queries with NXDOMAIN like badname.company.com, badname.company.othername.com... etc..
> While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.
Many rightfully interpret the RFC as "CNAME have to be before A", but the issue persists inbetween CNAMEs in the chain as noted in the article. If a record in the middle of the chain expires, glibc would still fail if the "middle" record was to be inserted between CNAMEs and A records.
I don't know the official process, but as a human that sometimes reads and implements IETF RFCs, I'd appreciate updates to the original doc rather than replacing it with something brand new. Probably with some dated version history.
Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.
And if we must supersede, I humbly request a warning be put at the top, linking the new standard.
> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:
> If recursive service is requested and available, the recursive response to a query will be one of the following:
> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.
It's pretty clear that CNAME is at the beginning.
The "possibly" does not refer to the order but rather to the presence.
After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.
We should probably all be glad that CloudFlare doesn't have the ability to update its entire global fleet any faster than 1h 28m, even if it’s a rollback operation.
Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.
> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.
Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.
It seems like an oversight, and that's totally fine.
I took it as being "we wrote the tests to the standard" and then built the code, and whoever was writing the tests didn't read that line as a testable aspect.
It's implied that they intentionally tested it that way, without any assertions on the order. Not by oversight of incompetence, but because they didn't want to bake the requirement in due to uncertainty.
Given my years of experience with Cisco "quality", I'm not surprised by this:
> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.
... but I am surprised by this:
> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.
Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.
Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.
Or when working on massive infrastructure like this, you write plenty of tests that would have saved you a month worth of work.
They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."
I don't find the wording in the RFC to be that ambiguous actually.
> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".
The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit order ambiguity everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.
Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.
All of which also serves to highlight the value of normative language, but that came later.
I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.
And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.
Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.
Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.
100%
I just commented the same.
It's pretty clear that the "possibly" refers to the presence of the CNAME RRs, not the ordering.
A great example of Hyrum's Law:
"With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody."
combined with failure to follow Postel's Law:
"Be conservative in what you send, be liberal in what you accept."
Postel's law is considered more and more harmful as the industry evolved.
That depends on how Postel's law is interpreted.
What's reasonable is: "Set reserved fields to 0 when writing and ignore them when reading." (I heard that was the original example). Or "Ignore unknown JSON keys" as a modern equivalent.
What's harmful is: Accept an ill defined superset of the valid syntax and interpret it in undocumented ways.
Very much so. A better law would be conservative in both sending and accepting, as it turns out that if you are liberal in what you accept, senders will choose to disobey Postel's law and be liberal in what they send, too.
I think it is okay to accept liberally as long as you combine it with warnings for a while to give offenders a chance to fix it.
"Warnings" are like the most difficult thing to 'send' though. If an app or service doesn't outright fail, warnings can be ignored. Even if not ignored... how do you properly inform? A compiler can spit out warnings to your terminal, sure. Test-runners can log warnings. An RPC service? There's no standard I'm aware of. And DNS! Probably even worse. "Yeah, your RRs are out of order but I sorted them for you." where would you put that?
> how do you properly inform?
Through the appropriate channels; in-band and out-of-band.
Warnings are ignored. It's much better to fail fast.
The Python 3 community was famously divided on that matter, wrt Python 3. Now that it is over, most people on the "accept liberally" side of the fence have jumped sides.
It's remarkable that the ordinary DNS lookup function in glibc doesn't work if the records aren't in the right order. It's amazing to me we went 20+ years without that causing more problems. My guess is most people publishing DNS records just sort of knew that the order mattered in practice, maybe figuring it out in early testing.
I think it's more of a server side ordering, in which there were not that many DNS servers out there, and the ones that didn't keep it in order quickly changed the behavior because of interop.
CNAMES are a huge pain in the ass (as noted by DJB https://cr.yp.to/djbdns/notes.html)
It's more likely because the internet runs on a very small number of authorative server implementations which all implement this ordering quirk.
Now that I have seemingly taken on managing DNS at my current company I have seen several inadequacies of DNS that I was not aware of before. Main one being that if an upstream DNS server returns SERVFAIL, there is no distinction really between if the server you are querying is failed, or the actual authoritative server upstream is broken (I am aware of EDEs but doesn't really solve this). So clients querying a broken domain will retry each of their configured DNS servers, and our caching layer (Unbound) will also retry each of their upstreams etc... Results in a bunch of pointless upstream queries like an amplification attack. Also have issue with the search path doing stupid queries with NXDOMAIN like badname.company.com, badname.company.othername.com... etc..
I kind of wish they start sending records in randomized order to take out all the broken implementations that depend on such a fragile property
> While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.
That's the only reasonable conclusion, really.
And I'm glad they came to it. Even if everyone else is wrong (I'm not saying they are) sometimes you just have to play along.
Many rightfully interpret the RFC as "CNAME have to be before A", but the issue persists inbetween CNAMEs in the chain as noted in the article. If a record in the middle of the chain expires, glibc would still fail if the "middle" record was to be inserted between CNAMEs and A records.
It’s always DNS.
I'm not an IETF process expert. Would this be worth filing errata against the original RFC in addition to their new proposed update?
Also, what's the right mental framework behind deciding when to release a patch RFC vs obsoleting the old standard for a comprehensive update?
I don't know the official process, but as a human that sometimes reads and implements IETF RFCs, I'd appreciate updates to the original doc rather than replacing it with something brand new. Probably with some dated version history.
Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.
And if we must supersede, I humbly request a warning be put at the top, linking the new standard.
> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:
> If recursive service is requested and available, the recursive response to a query will be one of the following:
> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.
> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.
It's pretty clear that CNAME is at the beginning.
The "possibly" does not refer to the order but rather to the presence.
If they are present, they are are first.
After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.
We should probably all be glad that CloudFlare doesn't have the ability to update its entire global fleet any faster than 1h 28m, even if it’s a rollback operation.
Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.
Given the seriousness of outages they make with instant worldwide deploys, I’m glad they took it calmly.
They had to update all the down detectors first.
> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.
Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.
It seems like an oversight, and that's totally fine.
I took it as being "we wrote the tests to the standard" and then built the code, and whoever was writing the tests didn't read that line as a testable aspect.
Fair enough.
My reading of that statement is their test, assuming they had one, looked something like this:
Which wouldn't have caught the ordering problem.It's implied that they intentionally tested it that way, without any assertions on the order. Not by oversight of incompetence, but because they didn't want to bake the requirement in due to uncertainty.
Given my years of experience with Cisco "quality", I'm not surprised by this:
> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.
... but I am surprised by this:
> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.
Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.
Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.
Or when working on massive infrastructure like this, you write plenty of tests that would have saved you a month worth of work.
They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."
I suspect that if you could save them this time, they'd gladly pay you for it. It'll be a bit of a sell, but they seem like a fairly sensible org.
Random DNS servers and clients being broken in weird ways is such a common problem and will probably never go away unless DNS is abandoned altogether.
It's surprising how something so simple can be so broken.