Faster Excel Parsing in Ruby — 27 Feb 2020

栏目: IT技术 · 发布时间: 5年前

内容简介:TL;DR:When exporting data for general use, we in the industry are likely to reach for CSV files; they’re basically plain-text, but with a sort-of agreed-upon structure — well, there isFor most people, though, Excel sheets are what’s used and understood.

Faster Excel Parsing in Ruby

TL;DR: xsv was ~5 times faster than alternatives at parsing the XLSX file I benchmarked it against, and allocates the fewest objects . Meanwhile, roo allocates the least memory of benchmarked gems by a wide margin .

When exporting data for general use, we in the industry are likely to reach for CSV files; they’re basically plain-text, but with a sort-of agreed-upon structure — well, there is RFC 4180 but Wikipedia agrees that implementations are inconsistent at best.

For most people, though, Excel sheets are what’s used and understood. XLSX (aka OOXML) files have been the default file format in Microsoft Office for some years, replacing the proprietary XLS format of yore, and few will look beyond that. Excel sheets can contain considerably more complex information than CSVs, but due to their ubiquity and, perhaps, some level of ignorance, they’re commonly used to transfer simple collections of rows. Long story short: we’re tasked with offering Excel import options in our apps.

And yet, extracting simple data from XLSX files in Ruby is slow . The leading XLSX parser according to The Ruby Toolbox is rubyXL , which is not particularly fast at this task and can suck a ton of memory in the process. Now don’t get me wrong: rubyXL is awesome and can do a lot more than simply read XLSX files, but in most cases that’s all I need.

And my friend, Martijn , is in a similar situation: he just needs to parse simple user-uploaded XLSX sheets that users upload. So he wrote a gem optimized for parsing speed: xsv . I thought the idea was really interesting, so I wrote some simple benchmarks for popular gems capable of parsing these files and tabulated the results below.

The Benchmarks

Gem Parses/second Parses/hour Allocated Memory Retained Memory Allocated Objects Retained Objects
rubyXL 0.129 463 934.014M 1.973M 14.413M 20,215
simple_xlsx_reader 0.282 1,015 462.283M 610.039k 8.21M 5,383
creek 0.422 1,518 911.538M 808.277k 13.180M 6,100
roo 0.301 1,084 168.783M 1.327M 2.232M 11,058
xsv 1.531 5,510 347.250M 362.540k 1.506M 3,166

Benchmarks were run on on a 2018 15-inch MacBook Pro with a 2.6GHz 6-Core Intel Core i7, 16 GB 2400MHz memory and MacOS Catalina 10.15.3. Ruby is 2.7.0, whilst the gem versions are:

  • rubyXL 3.4.12
  • simple_xslx_reader 1.0.4
  • creek 2.5.0
  • roo 2.8.3
  • xsv 0.3.2

Spreadsheet can be found here , provided by the Dutch “Stichtse Vecht”.

I chose not to use the compare feature of benchmark-ips , instead opting to run each benchmark completely individually to ensure no side-effects.

The benchmark code can be found here . Though it’s not part of the codebase, I verified that each benchmark handles the same data by outputting the contents of each cell to a file and comparing the results via cmp .

If you find any errors, please make an issue or a PR.

The Summary

Speed-wise, the benchmarks speak for themselves: xsv is ~5 times faster than alternatives.

Regarding memory use, I’m no benchmarking expert so I’ll just quote Sam Saffron himself on memory metrics (from memory_profiler ):

Retained: long lived memory use and object count retained due to the execution of the code block.
Allocated: All object allocation and memory allocation during code block.

roo does a great job in allocated memory, and there may be room for improvement in this area for xsv. Having said that, xsv does come second in memory allocation.

Don’t forget that xsv is only there to parse sheets; for writing and more advanced functionality, you’ll still need to reach for another tool.

Suggestions to improve xsv are welcome.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

谁说商业直觉是天生的

谁说商业直觉是天生的

[美] 戴夫·帕特奈克 (Dev Patnaik)、[美] 彼得·莫特森 (Peter Mortensen) / 马慧 / 万卷出版公司 / 2010-07 / 36.00

《Wired to Care》是帕特奈克集近年来在创新顾问公司 Jump Associates 实务经验,与史丹佛大学教学经验之大成,虽然《Wired to Care》定位为一本用设计创新方法谈企业管理的书,但本书,活像是一本近代的设计史,从以销售为设计目标的Raymond Loewy谈起,到以人为设计中心的OXO GOOD GRIPSSwivelPeeler削皮刀。由此作者向我们揭示了企业如何运......一起来看看 《谁说商业直觉是天生的》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具