1.1 Billion Taxi Rides using OmniSciDB (formerly MapD) and a MacBook Pro

栏目: IT技术 · 发布时间: 4年前

内容简介:Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro usi

Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro using OmniSciDB (formerly MapD) can happily do the job.

In large enterprises, getting sign off for new clusters and bringing in expertise to keep them operational is rarely a quick process. If the value of a new dataset hasn't yet been proven then the bureaucratic hurdles can end up putting businesses off from using some of the fastest analytical offerings on the market.

To add to this, many industries need to either keep their data isolated from the Cloud, have policies to not spend budget with certain Cloud vendors, see the transferring of data too lengthy for their time-sensitive needs or impractical to shift their datasets reliably with their existing infrastructure and expertise. Only software that can run on mainstream hardware on-premise or within an enterprise's existing data centre is likely to see adoption.

It's rare to find anyone in an office environment that doesn't at least have a laptop but most don't come with top-notch Nvidia GPUs, let alone run CUDA-friendly Linux environments. This is why I was excited to hear that OmniSciDB would not only target Intel CPUs as a first-class platform but their offering would install and run in a user-friendly manor on macOS, just as any other mac software would.

OmniSciDB's GPU offering still outperforms their CPU offering by some 3.2x according to benchmarks published on their homepage but good should never be the enemy of perfect. I've always aimed to offer my clients the most practical solutions and sometimes that is a laptop proving there is value in their data.

In this post, I'm going to see how fast OmniSciDB 5.3.1 can run the 1.1 billion taxi rides benchmark using a 16" MacBook Pro running macOS. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have asingle-page summary of all these benchmarks for comparison.

The 16" MacBook Pro

The laptop I'll be using is a 2019, Space Grey, 16" MacBook Pro running macOS 10.15.5. The CPU is an 8-core Intel Core i9 running at 2.4 GHz. There is 64 GB of 2666 MHz DDR4 RAM and an Apple AP2048N SSD with 2 TB of capacity connected via PCIe. AmorphousDiskMark 2.5.4 was able to read from this disk at 3,454.22 MB/s and write at 3,283.94 MB/s when working with 128 KB blocks sequentially with a queue depth of 32.

The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.

This machine currently retails for about $4,000 in the US before any sales taxes.

Importing 1.1 Billion Trips Into OmniSciDB

The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over a six year period. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

OmniSciDB has a Java dependency so I've opted to install OpenJDK 8 LTS using the HotSpot JVM from AdoptOpenJDK .

The default number of file descriptors a process can have open on macOS is 256. OmniSciDB will likely use more depending on the size of data, number of tables and number of concurrent connections. This limit can be increased on a per-session basis but instead I'll make the upgrade permanent with the following commands.

$ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxproc.plist
$ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxfiles.plist

$ sudo cp limit.max{proc,files}.plist /Library/LaunchDaemons/
$ sudo chmod 644 /Library/LaunchDaemons/limit.max{proc,files}.plist

$ sudo launchctl load -w /Library/LaunchDaemons/limit.maxproc.plist
$ sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist

The following proves that the file descriptors limit was increased.

$ ulimit -n

I was provided with a build of OmniSciDB 5.3.1 for macOS that I'll install via the Terminal in my home directory. I've decompressed the distributable and created a symlink so that I can address the parent folder as ~/omnisci .

$ cd ~
$ tar xvf omnisci-ee-master-Darwin-x86_64-cpu.tar.gz
$ ln -sfn $(ls -dt omnisci-ee* | head -n1) ~/omnisci

I'll then create a data folder for OmniSciDB.

$ mkdir -p ~/omnisci-data

I'll then initialise the data folder and launch OmniSciDB's Server.

$ cd ~/omnisci
$ bin/initdb ~/omnisci-data
$ bin/omnisci_server --data ~/omnisci-data

The above server process was kept running in a Terminal. You could also run it as a daemon, in a screen session or via tmux if you wish.

I'll create an alias to the OmniSciDB client and include the default credentials. This will let me launch it by typing omnisql in the Terminal.

$ alias omnisql="~/omnisci/bin/omnisql -p HyperInteractive"

The following will create the table for the taxi trips dataset.

$ omnisql
DROP TABLE IF EXISTS trips;

CREATE TABLE trips (
  trip_id                 INTEGER,
  vendor_id               TEXT ENCODING DICT(8),

  pickup_datetime         TIMESTAMP ENCODING FIXED(32),

  dropoff_datetime        TIMESTAMP ENCODING FIXED(32),
  store_and_fwd_flag      TEXT ENCODING DICT(8),
  rate_code_id            SMALLINT,
  pickup_longitude        DECIMAL(14,2),
  pickup_latitude         DECIMAL(14,2),
  dropoff_longitude       DECIMAL(14,2),
  dropoff_latitude        DECIMAL(14,2),
  passenger_count         SMALLINT,
  trip_distance           DECIMAL(14,2),
  fare_amount             DECIMAL(14,2),
  extra                   DECIMAL(14,2),
  mta_tax                 DECIMAL(14,2),
  tip_amount              DECIMAL(14,2),
  tolls_amount            DECIMAL(14,2),
  ehail_fee               DECIMAL(14,2),
  improvement_surcharge   DECIMAL(14,2),
  total_amount            DECIMAL(14,2),
  payment_type            TEXT ENCODING DICT(8),
  trip_type               SMALLINT,
  pickup                  TEXT ENCODING DICT(16),
  dropoff                 TEXT ENCODING DICT(16),

  cab_type                TEXT ENCODING DICT(8),

  precipitation           SMALLINT,
  snow_depth              SMALLINT,
  snowfall                SMALLINT,
  max_temperature         SMALLINT,
  min_temperature         SMALLINT,
  average_wind_speed      SMALLINT,

  pickup_nyct2010_gid     SMALLINT,
  pickup_ctlabel          TEXT ENCODING DICT(16),
  pickup_borocode         SMALLINT,
  pickup_boroname         TEXT ENCODING DICT(8),
  pickup_ct2010           TEXT ENCODING DICT(16),
  pickup_boroct2010       TEXT ENCODING DICT(16),
  pickup_cdeligibil       TEXT ENCODING DICT(8),
  pickup_ntacode          TEXT ENCODING DICT(8),
  pickup_ntaname          TEXT ENCODING DICT(8),
  pickup_puma             TEXT ENCODING DICT(8),

  dropoff_nyct2010_gid    SMALLINT,
  dropoff_ctlabel         TEXT ENCODING DICT(16),
  dropoff_borocode        SMALLINT,
  dropoff_boroname        TEXT ENCODING DICT(8),
  dropoff_ct2010          TEXT ENCODING DICT(16),
  dropoff_boroct2010      TEXT ENCODING DICT(16),
  dropoff_cdeligibil      TEXT ENCODING DICT(8),
  dropoff_ntacode         TEXT ENCODING DICT(8),
  dropoff_ntaname         TEXT ENCODING DICT(8),
  dropoff_puma            TEXT ENCODING DICT(8)
) WITH (FRAGMENT_SIZE=75000000);

OmniSciDB can import GZIP-compressed CSV files without needing them decompressed ahead of time. It also allows wild stars / globs so the 56 CSV files don't have to be named individually. This is both a big time and disk space saver.

Below I'll run a SQL command that will import the entire dataset into OmniSciDB.

COPY trips
FROM '/Users/mark/taxi_csv/*.gz'
WITH (HEADER='false');

The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.

Benchmarking OmniSciDB

The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".

$ omnisql
\timing

The following completed in 0.134 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY cab_type;

The following completed in 0.349 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY passenger_count;

The following completed in 0.542 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       count(*)
FROM trips
GROUP BY passenger_count,
         pickup_year;

The following completed in 3.312 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       cast(trip_distance as int) AS distance,
       count(*) AS the_count
FROM trips
GROUP BY passenger_count,
         pickup_year,
         distance
ORDER BY pickup_year,
         the_count desc;

Final Thoughts

The Q1 time is the fastest for any workstation benchmark I've done. To get this level of performance on a regular piece of office equipment is a big game changer. The laptop might seem expensive but it's a one-off purchase that can be depreciated over a few years.

There is something magical when a client points their Tableau installation at OmniSciDB and everything they throw at it appears to run instantaneously. The less friction between questions and answers means more time discovering the value of data. The more questions you ask of your data the greater the chance of discovering the unexpected.

And for the record, for those without a Tableau license to hand, OmniSciDB has a visualisation package called Immerse which also does an amazing job at near-instant visualisation on billions of rows.

To have OmniSciDB running on a regular MacBook Pro and optimised for the Intel CPUs Apple ships with is going to do amazing things for the world of analytics.


很遗憾的说,推酷将在这个月底关闭。人生海海,几度秋凉,感谢那些有你的时光。


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

SSA:用户搜索心理与行为分析

SSA:用户搜索心理与行为分析

[美] 罗森菲尔德(Louis Rosenfeld) / 汤海、蔡复青 / 清华大学出版社 / 2014-4-1 / 59.00

何为站内搜索分析(SSA)?它如何帮助你挖掘用户搜索曰志,从中洞悉用户搜索心理和行为,从而有针对性地改善用户体验,提升网站价值?这些都可以从《SSA:用户搜索心理与行为分析》中找到答案。《SSA:用户搜索心理与行为分析》首先通过故事来说明SSA是如何使Vanguard集团起死回生的,简要介绍SSA并指导读者动手实践。其次,通过丰富的实例来介绍很多工具和方法,帮助读者着手分析用户查询数据,从中获得更......一起来看看 《SSA:用户搜索心理与行为分析》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

URL 编码/解码
URL 编码/解码

URL 编码/解码