Rust Dataframe: Update 1

栏目: IT技术 · 发布时间: 4年前

内容简介:I started working on an experimental Rust Dataframe in January 2019, but due to work commitments, I've been unable to work on it regularly. The project is still in its infancy, and is not abandoned. I'm writing this update mostly for people who might be in

Rust Dataframe Updates

Update 01: 04-04-2020

I started working on an experimental Rust Dataframe in January 2019, but due to work commitments, I've been unable to work on it regularly. The project is still in its infancy, and is not abandoned. I'm writing this update mostly for people who might be interested in helping out, or following along as I go along with the experiment.

I haven't had a chance to post updates on the experiment, but now that we're home-bound due to the lockdown in South Africa, this is a good chance to do this.

Overall Goal

In the past 3 years, most of my work has involved me bringing my own infra, which is often my laptop. I use Apache Spark (PySpark) and Pandas regularly, and I also write Rust a lot. I started thinking about how an undistributed dataframe library could look like in Rust, mainly to be my daily driver for my kind of work.

My current goals are to create a dataframe library that:

  • can work on my laptop, and be as fast as Spark or Dask
  • I can use in Rust and in Python (and other languages) for exploratory purposes

Design

Memory Format

The dataframe is built on top of Apache Arrow , which is an in-memory columnar data format. There are ample resources online that explain what Arrow is and does, so allow me not to digress.

Arrow is optimised to avoid common performance pitfalls like data (de)serialisation, and data is laid out in a compute-optimised structure. Using Arrow also makes it easier for us to interoperate with other libraries in future (e.g. converting to/from Pandas dataframes, writing UDFs in Python/JavaScript, etc).

An Arrow equivalent of a table is the RecordBatch . A RecordBatch has columns made up of equal-length Array s, and a schema describing those columns.

#[derive(Clone)]
pub struct RecordBatch {
    schema: Arc<Schema>,
    columns: Vec<Arc<Array>>,
}

In turn, our Dataframe is a collection of RecordBatches, though broken down into Column s which contain chunks of arrays. I have adapted this from the Arrow C++ library as this structure makes it easier to operate on a single column across all record batches without needing to iterate on the batches of the dataframe.

IO Capabilities

Arrow has decent IO support to get one started, and the Rust implementation supports:

Data Source Read Write
csv yes yes
json yes (limited) no
parquet yes no (in progress)
Arrow IPC* yes yes

*Arrow IPC (Inter-Process Communication) is a streaming and file format that allows different Arrow implementations to share data with each other.

When more IO formats/connections are supported, we get to benefit from them. I've also spent a bit of time working on SQL support (with PostgreSQL as a start). I recently completed a PostgreSQL binary reader, and I plan on benchmarking it against a row-based approach. If it's performant enough, I'd like to split it out into its own crate/library, and add more SQL variants.

I recently needed to extract data from MongoDB to CSV, so I wrote a stand-alone MongoDB connector . The connector uses the aggregation framework to pull batches of documents, which it then converts to Arrow data. The Arrow data can then be converted to other formats which Arrow supports, such as Parquet or CSV. I haven't added schema inference and write support, but they're on my TODO list. If I end up not using it on the dataframe, other people could still benefit from such a library.

Compute

The Rust implementation has some basic compute support, and there's also DataFusion which is a distributed SQL engine built on top of Arrow. In our dataframe, we will leverage Arrow's compute capabilities as much as possible, which means contributing compute functions (kernels) upstream on Arrow if the greater community would benefit from them.

Arrow currently supports a lot of the primitive building blocks, such as:

  • some aggregations
  • primitive arithmetic (add, mul, div, sub)
  • boolean comparisons (and, or, not)
  • casting between compatible data types
  • comparisons (gt, lt, eq, neq, etc.)
  • filtering, limits (slicing)
  • sorting (I'm still working on this)

These building blocks allow one to perform a lot of compute, when stitched together. We still have a long way to go though, as we don't yet support transformations like joins and some complex aggregates (DataFusion has aggregate support, here I mean in Arrow itself).

My goals for compute in the dataframe are to support most of what Apache Spark supports. It might not be all functions, but I want to cover the various categories (scalar, array functions, aggregations and window functions). I haven't thought of how user-defined fuctions could work, but Andy Grove recently submitted a pull request for UDF support in DataFusion, and there is ongoing discussion of whether this could live in Arrow.

Lazy vs Eager Evaluation

When I started the experiment, I wanted to create a POC to see if it's feasible to create an Arrow-backed datafrme in Rust. This used eager evaluation (expressions were calculated immediately without some planning and optimisation). After that, I started exploring how to support lazy-evaluation.

The repository is currently an in-flux mess because of that, but I see light at the end of the tunnel.

The idea is to buffer operations, and only perform calculations when we need to display or save data (same way deferred computation libraries work). This is the really fun part because of the possibilities of optimising computations.

I recently started a draft of this optimisation , but I'll write more about this in a separate update.

There is an abstraction (so far called a 'LazyFrame' lol) which represents an output dataframe, and the expressions being applied on it.

#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct LazyFrame {
    id: String,
    pub(crate) expression: Expression,
    output: Dataset,
}

Expressions are the types of data operations being performed, and we currently have 3:

pub enum Expression {
    Read(Computation),
    Compute(Box<Expression>, Computation),
    Join(Box<Expression>, Box<Expression>, JoinCriteria, Dataset),
}

Expression::Compute in turn handles most of the computations, and has the inputs, transformations, and the output dataframe structure:

#[derive(Serialize, Deserialize, Debug, Clone)]
pub struct Computation {
    pub(crate) input: Vec<Dataset>,
    pub(crate) transformations: Vec<Transformation>,
    pub(crate) output: Dataset,
}

There's still a bit of duplication here and there, but I am cleaning up the interface as I go along. Everything's also quite verbose under the hood, but I plan on creating a simplified higher-level API. In some of the earlier discussions around a Rust dataframe, there was a bit of opposition to stringy APIs, but they're very convenient. It's better to type dataframe.filter('a > 3') than to try construct that using the building blocks of the API.

I ultimately plan on making lazy evaluation the default.

How will bindings work?

Rust doesn't have a runtime, and lazily evaluated compute tends to be reproducible without maintaining state. I'm thinking that the Rust library would be stateless, and a Python/JS binding would keep track of the LazyFrame and its expressions (such as "read this SQL table, then perform these computations"), then we could invoke transformations when we need to display or save data.

This would make the dataframe usable in Jupyter notebooks and equivalents.

Performance

The short answer is "I don't know how it performs yet". I'll write benchmarks in the coming weeks/months, especially as we support more functionality.

Can the Library be Used?

If you'd like to experiment, then yes; otherwise there are Rust-based alternatives that will be more stable and move at a faster pace. Check out:

Other Interesting Tidbits

Arrow has a subproject called Gandiva , it's an LLVM-based analytical expression compiler ( intro blog post ). The summary of it is that it allows more optimised compute of Arrow data, as opposed to hand-rolling optimisations.

It's written in C++ but there's interest in creating Rust bindings, so perhaps we could use it for compute in future.


以上所述就是小编给大家介绍的《Rust Dataframe: Update 1》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

奥美的数字营销观点

奥美的数字营销观点

[美] 肯特·沃泰姆、[美] 伊恩·芬威克 / 台湾奥美互动营销公司 / 中信出版社 / 2009-6 / 45.00元

目前,媒体的数字化给营销人带来了重大影响。新媒体世界具有多重特性,它赋予企业大量机会,同时也带来挑战。营销人有了数量空前的方式来与消费者互动。然而,许多人面对变革的速度感到压力巨大,而且不知道该如何完全发挥这些新选择所带来的优势。 本书为读者提供了如何运用主要数字媒体渠道的方法;随附了领先的营销人如何在工作中有效运用这些渠道的最佳案例;提供了数字营销的十二个基本原则;协助数字营销人了解什么是......一起来看看 《奥美的数字营销观点》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具