内容简介:Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro usi
Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro using OmniSciDB (formerly MapD) can happily do the job.
In large enterprises, getting sign off for new clusters and bringing in expertise to keep them operational is rarely a quick process. If the value of a new dataset hasn't yet been proven then the bureaucratic hurdles can end up putting businesses off from using some of the fastest analytical offerings on the market.
To add to this, many industries need to either keep their data isolated from the Cloud, have policies to not spend budget with certain Cloud vendors, see the transferring of data too lengthy for their time-sensitive needs or impractical to shift their datasets reliably with their existing infrastructure and expertise. Only software that can run on mainstream hardware on-premise or within an enterprise's existing data centre is likely to see adoption.
It's rare to find anyone in an office environment that doesn't at least have a laptop but most don't come with top-notch Nvidia GPUs, let alone run CUDA-friendly Linux environments. This is why I was excited to hear that OmniSciDB would not only target Intel CPUs as a first-class platform but their offering would install and run in a user-friendly manor on macOS, just as any other mac software would.
OmniSciDB's GPU offering still outperforms their CPU offering by some 3.2x according to benchmarks published on their homepage but good should never be the enemy of perfect. I've always aimed to offer my clients the most practical solutions and sometimes that is a laptop proving there is value in their data.
In this post, I'm going to see how fast OmniSciDB 5.3.1 can run the 1.1 billion taxi rides benchmark using a 16" MacBook Pro running macOS. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have asingle-page summary of all these benchmarks for comparison.
The 16" MacBook Pro
The laptop I'll be using is a 2019, Space Grey, 16" MacBook Pro running macOS 10.15.5. The CPU is an 8-core Intel Core i9 running at 2.4 GHz. There is 64 GB of 2666 MHz DDR4 RAM and an Apple AP2048N SSD with 2 TB of capacity connected via PCIe. AmorphousDiskMark 2.5.4 was able to read from this disk at 3,454.22 MB/s and write at 3,283.94 MB/s when working with 128 KB blocks sequentially with a queue depth of 32.
The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.
This machine currently retails for about $4,000 in the US before any sales taxes.
Importing 1.1 Billion Trips Into OmniSciDB
The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over a six year period. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.
OmniSciDB has a Java dependency so I've opted to install OpenJDK 8 LTS using the HotSpot JVM from AdoptOpenJDK .
The default number of file descriptors a process can have open on macOS is 256. OmniSciDB will likely use more depending on the size of data, number of tables and number of concurrent connections. This limit can be increased on a per-session basis but instead I'll make the upgrade permanent with the following commands.
$ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxproc.plist $ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxfiles.plist $ sudo cp limit.max{proc,files}.plist /Library/LaunchDaemons/ $ sudo chmod 644 /Library/LaunchDaemons/limit.max{proc,files}.plist $ sudo launchctl load -w /Library/LaunchDaemons/limit.maxproc.plist $ sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
The following proves that the file descriptors limit was increased.
$ ulimit -n
I was provided with a build of OmniSciDB 5.3.1 for macOS that I'll install via the Terminal in my home directory. I've decompressed the distributable and created a symlink so that I can address the parent folder as ~/omnisci .
$ cd ~ $ tar xvf omnisci-ee-master-Darwin-x86_64-cpu.tar.gz $ ln -sfn $(ls -dt omnisci-ee* | head -n1) ~/omnisci
I'll then create a data folder for OmniSciDB.
$ mkdir -p ~/omnisci-data
I'll then initialise the data folder and launch OmniSciDB's Server.
$ cd ~/omnisci $ bin/initdb ~/omnisci-data $ bin/omnisci_server --data ~/omnisci-data
The above server process was kept running in a Terminal. You could also run it as a daemon, in a screen session or via tmux if you wish.
I'll create an alias to the OmniSciDB client and include the default credentials. This will let me launch it by typing omnisql in the Terminal.
$ alias omnisql="~/omnisci/bin/omnisql -p HyperInteractive"
The following will create the table for the taxi trips dataset.
$ omnisql
DROP TABLE IF EXISTS trips; CREATE TABLE trips ( trip_id INTEGER, vendor_id TEXT ENCODING DICT(8), pickup_datetime TIMESTAMP ENCODING FIXED(32), dropoff_datetime TIMESTAMP ENCODING FIXED(32), store_and_fwd_flag TEXT ENCODING DICT(8), rate_code_id SMALLINT, pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2), passenger_count SMALLINT, trip_distance DECIMAL(14,2), fare_amount DECIMAL(14,2), extra DECIMAL(14,2), mta_tax DECIMAL(14,2), tip_amount DECIMAL(14,2), tolls_amount DECIMAL(14,2), ehail_fee DECIMAL(14,2), improvement_surcharge DECIMAL(14,2), total_amount DECIMAL(14,2), payment_type TEXT ENCODING DICT(8), trip_type SMALLINT, pickup TEXT ENCODING DICT(16), dropoff TEXT ENCODING DICT(16), cab_type TEXT ENCODING DICT(8), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel TEXT ENCODING DICT(16), pickup_borocode SMALLINT, pickup_boroname TEXT ENCODING DICT(8), pickup_ct2010 TEXT ENCODING DICT(16), pickup_boroct2010 TEXT ENCODING DICT(16), pickup_cdeligibil TEXT ENCODING DICT(8), pickup_ntacode TEXT ENCODING DICT(8), pickup_ntaname TEXT ENCODING DICT(8), pickup_puma TEXT ENCODING DICT(8), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel TEXT ENCODING DICT(16), dropoff_borocode SMALLINT, dropoff_boroname TEXT ENCODING DICT(8), dropoff_ct2010 TEXT ENCODING DICT(16), dropoff_boroct2010 TEXT ENCODING DICT(16), dropoff_cdeligibil TEXT ENCODING DICT(8), dropoff_ntacode TEXT ENCODING DICT(8), dropoff_ntaname TEXT ENCODING DICT(8), dropoff_puma TEXT ENCODING DICT(8) ) WITH (FRAGMENT_SIZE=75000000);
OmniSciDB can import GZIP-compressed CSV files without needing them decompressed ahead of time. It also allows wild stars / globs so the 56 CSV files don't have to be named individually. This is both a big time and disk space saver.
Below I'll run a SQL command that will import the entire dataset into OmniSciDB.
COPY trips FROM '/Users/mark/taxi_csv/*.gz' WITH (HEADER='false');
The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.
Benchmarking OmniSciDB
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
$ omnisql
\timing
The following completed in 0.134 seconds.
SELECT cab_type, count(*) FROM trips GROUP BY cab_type;
The following completed in 0.349 seconds.
SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;
The following completed in 0.542 seconds.
SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count(*) FROM trips GROUP BY passenger_count, pickup_year;
The following completed in 3.312 seconds.
SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER BY pickup_year, the_count desc;
Final Thoughts
The Q1 time is the fastest for any workstation benchmark I've done. To get this level of performance on a regular piece of office equipment is a big game changer. The laptop might seem expensive but it's a one-off purchase that can be depreciated over a few years.
There is something magical when a client points their Tableau installation at OmniSciDB and everything they throw at it appears to run instantaneously. The less friction between questions and answers means more time discovering the value of data. The more questions you ask of your data the greater the chance of discovering the unexpected.
And for the record, for those without a Tableau license to hand, OmniSciDB has a visualisation package called Immerse which also does an amazing job at near-instant visualisation on billions of rows.
To have OmniSciDB running on a regular MacBook Pro and optimised for the Intel CPUs Apple ships with is going to do amazing things for the world of analytics.
很遗憾的说,推酷将在这个月底关闭。人生海海,几度秋凉,感谢那些有你的时光。
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
程序是怎样跑起来的
[日] 矢泽久雄 / 李逢俊 / 人民邮电出版社 / 2015-4 / 39.00元
本书从计算机的内部结构开始讲起,以图配文的形式详细讲解了二进制、内存、数据压缩、源文件和可执行文件、操作系统和应用程序的关系、汇编语言、硬件控制方法等内容,目的是让读者了解从用户双击程序图标到程序开始运行之间到底发生了什么。同时专设了“如果是你,你会怎样介绍?”专栏,以小学生、老奶奶为对象讲解程序的运行原理,颇为有趣。本书图文并茂,通俗易懂,非常适合计算机爱好者及相关从业人员阅读。一起来看看 《程序是怎样跑起来的》 这本书的介绍吧!