CSV解析了嵌入式双引号嵌入式、双引号、CSV

2023-09-11 04:58:05 作者:木槿花开

我写了一个简单的CSV文件解析器。但看在CSV格式维基页面后我注意到一些扩展的基本格式。通过双引号具体嵌入逗号。我已经成功地解析这些,但有一个第二个问题:嵌入式双引号

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.

例如:

12345,ABC,IJKXYZ - > [1234]和[美国广播公司,IJKXYZ]

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

我似乎无法找到一个封闭的双引号,没有区分的正确方法。所以我的问题是什么是正确的方法/算法来分析CVS格式,如上面的那个?

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

推荐答案

我常想这个问题的方法基本上是看报值作为一个单一的,未加引号的值的或双的序列引述形成引号参加了值的值。即,

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,

解析该行中的下一个原子: 阅读到第一个非空白字符 如果当前的字符不是报价: 标记当前点 读取到下一个逗号或换行 逗号之前返回的标记和字符之间的文本(带空格如适用) to parse the next atom in the row: read up to the first non whitespace character if the current character is not a quote: mark the current spot read up to the next comma or newline return the text between the mark and the character before the comma (strip spaces if appropriate) 创建一个空字符串缓冲区 ,而当前字符的没有的报价 标记当前位置+1(跳过引号字符) 读取到下一个报价 如果缓冲区不为空,追加报价吧 追加到缓冲标记和当前位置之前的字符之间的文本(剥离两个引号) 在推进一个字符(过去刚才读报价) create an empty string buffer while the current character is not a quote mark the current position +1 (skip the quote character) read up to the next quote if the buffer is not empty, append a quote to it append to the buffer the text between the mark and the character before the current position (to strip both quotes) advance one character (past the just read quote)

基本上,分裂引用字符串的每一双引号段,然后用引号链状在一起。是这样的:ABC,IJKXYZ变成 ABC,  IJK   XYZ ,这反过来又成为 ABC, IJK  XYZ

essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC,, IJK, XYZ, which in turn becomes ABC,"IJK"XYZ