I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
CSV quoting is dialect dependent. Honestly you should just never use CSV for anything if you can avoid it, it's inferior to TSV (or better yet JSON/JSONL) and has a tendency to appear like it's working but actually be hiding bugs like this one.
I'd go so far as to say any implementation that doesn't conform to RFC 4180[1] is broken and should be fixed. The vast majority of implementations get this right, it's just that some that don't are so high profile it causes people to throw up their hands and give up.
Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
Sure, but why Moldova of all places? I've seen this form usually for places where there's a dispute for the short name, like Nice/Naughty Korea, Taiwan/West Taiwan, or Macedonia/entitled Greek government.
Come on man. What are we doing here. This is not even anything interesting like Norway being interpreted as False in YAML. This is just a straightforward escaping issue.
I dont understand people who dont validate their inputs and outputs - a count of expected values would've prevented such a basic mistake
I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
Yes, but tabs also can appear in text fields. If you are free to pick not csv, then perhaps consider feather or parquet?
RFC 4180 [1] Section 2.6 says:
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
If the DMS output isn’t quoting fields that contain commas, that’s technically invalid CSV.
A small normalization step before COPY (or ensuring the writer emits RFC-compliant CSV in the first place) would make the pipeline robust without renaming countries or changing delimiters.
That way, if/when the DMS output is fixed upstream, nothing downstream needs to change.
[1] https://www.rfc-editor.org/rfc/rfc4180.html
This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
CSV quoting is dialect dependent. Honestly you should just never use CSV for anything if you can avoid it, it's inferior to TSV (or better yet JSON/JSONL) and has a tendency to appear like it's working but actually be hiding bugs like this one.
Most CSV dialects have no problem having double quoted commas.
The "dialect dependent" part is usually about escaping double quotes, new lines and line continuations.
Not a portable format, but it is not too bad (for this use) either considering the country list is mostly static
I'd go so far as to say any implementation that doesn't conform to RFC 4180[1] is broken and should be fixed. The vast majority of implementations get this right, it's just that some that don't are so high profile it causes people to throw up their hands and give up.
[1]: https://datatracker.ietf.org/doc/html/rfc4180
Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
I was expecting a Markdown-related .md issue. :)
I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
That's a cool map.
It's almost certainly the result of applying AI stylization to this map https://commons.wikimedia.org/wiki/File:Europe-Moldova.svg without attribution, messing up some of the smaller details in the process.
"Sanitize at the boundary"
Ah, but what _is_ the boundary, asks Transnistria?
LoL, good one.
Did you really name your breakaway republic Sealand'); DROP TABLE Countries;--?
just use TSV instead of CSV by default
The majority of countries official names are in this format. We just use the short forms. "Republic of ..." is the most common formal country name: https://en.wikipedia.org/wiki/List_of_sovereign_states
Sure, but why Moldova of all places? I've seen this form usually for places where there's a dispute for the short name, like Nice/Naughty Korea, Taiwan/West Taiwan, or Macedonia/entitled Greek government.
Huge skill issue. Nothing to see here.
Come on man. What are we doing here. This is not even anything interesting like Norway being interpreted as False in YAML. This is just a straightforward escaping issue.