与异构数据的工作在静态类型语言(F#)静态、类型、语言、异构

2023-09-04 01:38:45 作者:万般努力,必出人头地.

一个F#的主张是,它允许交互式脚本和数据处理/探索。我一直在玩弄F#试图获得一个感觉它的Matlab和R比较如何进行数据分析工作。显然,F#不具备这些生态系统的所有实用的功能,但我更感兴趣的是底层语言的一般优点/缺点。

One of F#'s claims is that it allows for interactive scripting and data manipulation / exploration. I've been playing around with F# trying to get a sense for how it compares with Matlab and R for data analysis work. Obviously F# does not have all practical functionality of these ecosystems, but I am more interested in the general advantages / disadvantages of the underlying language.

对于我来说,最大的变化,甚至在功能性风格,是F#是静态类型。这有一定的吸引力,而且还经常感觉像紧身衣。举例来说,我还没有找到一个方便的方法来处理异构矩形数据 - 认为数据帧中河假设我读一个CSV文件名(字符串)和重量(浮动)。通常情况下我加载数据,执行一些变换,添加变量等,然后运行分析。在R,第一部分可能是这样的:

For me the biggest change, even over the functional style, is that F# is statically typed. This has some appeal, but also often feels like a straightjacket. For instance, I have not found a convenient way to deal with heterogeneous rectangular data -- think dataframe in R. Assume I'm reading a CSV file with names (string) and weights (float). Typically I load data in, perform some transformations, add variables, etc, and then run analysis. In R, the first part might look like:

df <- read.csv('weights.csv')
df$logweight <- log(df$weight)

在F#中,目前还不清楚是什么结构,我应该用它来做到这一点。据我可以告诉我有两种选择:1)我可以定义一个类:第一,是强类型的(专家F#9.10)或2),我可以用一个异构容器例如ArrayList。一种静态类型的类似乎并不可行,因为我需要在加载数据后添加在一个特设的方式(logweight)变量。异构的容器也很不方便,因为我每次访问变量时我将需要拆箱它。在F#:

In F#, it's not clear what structure I should use to do this. As far as I can tell I have two options: 1) I can define a class first that is strongly typed (Expert F# 9.10) or 2) I can use a heterogeneous container such as ArrayList. A statically typed class doesn't seem feasible because I need to add variables in an ad-hoc manner (logweight) after loading the data. A heterogeneous container is also inconvenient because every time I access a variable I will need to unbox it. In F#:

let df = readCsv("weights.csv")
df.["logweight"] = log(double df.["weight"])

如果这是一次或两次,它可能是好的,但指定类型的每次的时候,我用一个变量似乎并不合理。我经常处理调查与被添加变量/掉线,分裂成新的子集100S,并与其他dataframes合并。

If this were once or twice, it might be okay, but specifying a type every time I use a variable doesn't seem reasonable. I often deal with surveys with 100s of variables that are added/dropped, split into new subsets and merged with other dataframes.

我缺少一些明显的第三选择?有一些乐趣和光的方式进行交互和操作异构数据?如果我需要做数据分析的.Net,我现在的感觉是,我应该使用IronPython的所有数据探查/转换/交互工作,并且只使用F#/ C#进行数字密集的地方。本质上是F#错误的工具进行快速和肮脏的异构数据的工作?

Am I missing some obvious third choice? Is there some fun and light way to interact and manipulate heterogeneous data? If I need to do data analysis on .Net, my current sense is that I should use IronPython for all the data exploration / transformation / interaction work, and only use F#/C# for numerically intensive parts. Is F# inherently the wrong tool for quick and dirty heterogeneous data work?

推荐答案

我认为有一些其他的选择。

I think that there are a few other options.

布赖恩提到的,你可以使用 操作符(?):

As Brian mentioned, you can use the (?) operator:

type dict<'a,'b> = System.Collections.Generic.Dictionary<'a,'b>

let (?) (d:dict<_,_>) key = unbox d.[key]
let (?<-) (d:dict<_,_>) key value = d.[key] <- box value

let df = new dict<string,obj>()
df?weight <- 50.
df?logWeight <- log(df?weight)

这确实使用装箱/拆箱每次访问,有时你可能需要添加类型注释:

This does use boxing/unboxing on each access, and at times you may need to add type annotations:

(* need annotation here, since we could try to unbox at any type *)
let fltVal = (df?logWeight : float)

顶级标识符

另一种可能性是,而不是动态的现有对象定义属性(F#不支持特别好),你可以只使用顶级的标识符。

Top level identifiers

Another possibility is that rather than dynamically defining properties on existing objects (which F# doesn't support particularly well), you can just use top level identifiers.

let dfLogWeight = log(dfWeight)

这有,你将几乎从来没有需要指定类型,虽然它可能会弄乱你的顶级命名空间的优势。

This has the advantage that you will almost never need to specify types, though it may clutter your top-level namespace.

这需要更多的打字和丑陋的语法,最后一个选择是创建强类型的性对象:

A final option which requires a bit more typing and uglier syntax is to create strongly typed "property objects":

type 'a property = System.Collections.Generic.Dictionary<obj,'a>

let createProp() : property<'a> = new property<'a>()
let getProp o (prop:property<'a>) : 'a = prop.[o]
let setProp o (prop:property<'a>) (value:'a) = prop.[o] <- value

let df = new obj()
let (weight : property<double>) = createProp()
let (logWeight : property<double>) = createProp()

setProp df weight 50.
setProp df logWeight (getProp df weight)
let fltVal = getProp df logWeight

此需要显式创建的每个属性(和需要的类型注释在该点),但没有类型注释将需要后。我觉得这远远超过其他选项的可读性,尽管可能定义一个操作符来替换 getProp 将缓解这一一些。

This requires each property to be explicitly created (and requires a type annotation at that point), but no type annotations would be required after that. I find this much less readable than the other options, although perhaps defining an operator to replace getProp would alleviate that somewhat.