I just found a minimal example of how easy it is to confuse R’s CSV parser when providing it with ill-formatted data. To make it easy to understand, I put material for reproducing the problem up on GitHub.
I’m sure one could construct many similar examples for Python and Julia. The problem here is two-fold: (1) the CSV format is just kind of awful and (2) many end-users complain when CSV parsers don’t “successfully” interpret the mis-formatted files they want to parse. To borrow language from the web, essentially all CSV parsers have been compelled to operate in quirks mode all the time.
As far as I can see, there’s very little that would need to be changed to create a less awful CSV format. Here’s some minimal changes that could be used to define a new CSV++ format that would be far easier to parse and validate rigorously:
- Every CSV++ file starts with a short preamble.
- The first section of the preamble indicates the number of rows and columns.
- The second section of the preamble indicates the names and types of all of the columns.
- The third section of the preamble indicates the special sequence that’s used to represent NULL values.
- The fourth section of the preamble indicates (a) the separator sequence that’s used to separate columns within each row and (b) the separator sequence that’s used to separate rows.
- String values are always quoted. Values of other types are never quoted.
- CSV++ formatted files must never contain comments or any other non-content-carrying bits.
From my perspective, the only problem with this proposed format is that it assumes CSV++ files are written in batch mode because you need to know the number of rows before you can start writing data. I don’t really think this is so burdensome; in a streaming context, you could just write out files of length N and then do some last-minute patching to the final file that you wrote to correct the inaccurate number of rows that you indicated in the preamble during the initial writing phase.
I have little-to-no hope that such a format would ever become popular, but I think it’s worth documenting just how little work would be required to replace CSV files with a format that’s far easier for machines to parse while only being slightly harder for humans to write. When I think about the trade-off between human and machine needs in computing, the CSV format is my canonical example of something that ignored the machine needs so systematically that it became hard for humans to use without labor-intensive manual data validation.
Update: Ivar Nesje notes that the preamble also needs to indicate the quotation character for quoting strings and specify the rule for escaping that character within strings.