Redshift: PostgreSQL-like in the cloud (benchmark)

Comments:"Redshift: PostgreSQL-like in the cloud (benchmark) - RarestBlog"

URL:http://rarestblog.com/blog/2013/02/15/redshift-postgresql-in-the-cloud/

Amazon has opened it’s RedShift today.

Naturally I was curious about it’s performance. Basically it’s database with PostgreSQL-compatible protocol in the cloud

The full test results are under the cut, just a gist of it:

Insert 1M entries: local 8 seconds, RedShift 49 seconds.
Select by 2 rows (without indices): local 129ms seconds, RedShift ~54ms
Select by 1 row (without indices): local 44ms, RedShift ~448ms
(SELECT 1 was 100ms, so that's the round-trip time, I subtracted it 
from RedShift's values)

The test is VERY SYNTHETIC. Don’t take it as a comprehensive comparison.

Of course this does not prove anything. But it seems that 1M rows is not the use case for RedShift. I did not try JOINs, maybe it’s better with JOINs.

Afterall, they do say it’s for Business Intelligence systems. They also say “petabyte-scale”. So let’s try to scale up…

After some more testing it seems like yes > 5M is where you start to feel the difference:

Items count: 6000000
select by shop_code and article 0.766868
select by shop_code 0.052745
test: redshift
select by shop_code and article 0.277642 (round-trip subtracted)
select by shop_code 0.179468 (round-trip subtracted)

And it’s quite obvious that the complex queries are running faster on RedShift.

Though it continues to grow also:

Items count: 10000000
test:local
select by shop_code and article 1.449393
select by shop_code 0.046846
test: redshift
select by shop_code and article 0.653305
select by shop_code 0.349823

Let’s try a weird self-join, maybe that’ll help:

SELECT * FROM items i1 INNER JOIN items i2 ON i1.shop_code=i2.shop_code LIMIT 10
self-join local 0.001838
self-join remote 2.838698

What a disappointment!

And finally I’ve got something:

SELECT * FROM items i1 
 INNER JOIN items i2 ON i1.shop_code=i2.shop_code 
 WHERE i1.article='578' LIMIT 10000
self-join local 1.394531
self-join remote 3.997136

Let’s try even more stress:

SELECT * FROM items i1 
 INNER JOIN items i2 ON i1.shop_code=i2.shop_code 
 INNER JOIN items i3 ON i2.article=i3.article 
 WHERE i1.article='921' LIMIT 10000
self-join local 1.432942
self-join remote^C^Z

Ok, I waited for a couple minutes and got bored. RedShift did not plan to respond to my non-sense.

SELECT count(*) FROM (SELECT * FROM items LIMIT 100) i1 
 INNER JOIN (SELECT * FROM items LIMIT 100) i2 ON i1.shop_code=i2.shop_code 
 WHERE i2.article='478'
self-join local 0.002615
self-join local 0.000733
self-join local 0.000586
self-join local 0.000573
self-join local 0.000542
self-join remote 25.864774
self-join remote 4.878707
self-join remote 5.463815
self-join remote 1.992176
self-join remote 1.796341

Then it hovered around 1.9 for a while. What if we kick it up a bit?

Getting weird:

SELECT count(*) FROM (SELECT * FROM items LIMIT 10000) i1 
 INNER JOIN (SELECT * FROM items LIMIT 10000) i2 ON i1.shop_code=i2.shop_code 
 WHERE i2.article='880'
self-join local 0.008671
self-join remote 1.448579

(I always do 5 tests and if the numbers are more or less the same - I keep only it)

Finally RedShift is doing something better:

SELECT count(*) FROM items i1 
 INNER JOIN (SELECT * FROM items LIMIT 100000) i2 ON i1.shop_code=i2.shop_code 
 WHERE i2.article='208'
self-join local 15.323511
self-join remote 8.412499
count(*) = 67 Mil

And finally:

Items count: 10000000
SELECT count(*) FROM items i1 
 INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code 
 WHERE i2.article='575'
self-join local 38.425009
self-join remote 23.762500
self-join remote 17.709055
self-join remote 14.058966
self-join remote 15.046434
self-join remote 15.029501
The count(*) is 250+Mil

So the plot looks like this:

EDIT: vertical = seconds to run a query, horizontal = number of millions of rows in count(*), which is proportional to number of rows analyzed, which should be about 1000 times bigger than count(*), since there were 1000 articles in test.

As a final accord I tried to scale the cluster to 4 smaller machines (which is x-large).

After scaling the same query went from 15.029501 to 8.644006, but only once. Then the cluster stopped responding:

$ ruby test.redshift.rb 
pg.rb:38:in `initialize': could not connect to server: Connection timed out (PG::Error)
Is the server running on host "test1.****.us-east-1.redshift.amazonaws.com" (xxx.xxx.xxx.xxx) and accepting

In about 5 minutes it returned back to normal and in fact scaled linearly (I started 4 node cluster):

SELECT count(*) FROM items i1 
 INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code 
 WHERE i2.article='73'
self-join local 38.425009
self-join remote 3.040255
self-join remote 2.763937
self-join remote 3.071994
self-join remote 2.962063
self-join remote 2.567816

It was fun, but was time to say good bye!

There is also an interesting feature - snapshots. So basically, you can snapshot your data, destroy the cluster. I assume that then you don’t pay anything… But I might be wrong here. And then you can restore it.

The prices

0.850*24*31 = 632$ / smaller instance / month (monthly)
(2500+0.215*24*365)/12 = 365$ / smaller instance / month (prepay 1 year)
(3000+0.114*24*365*3)/12/3 = 166$ / smaller instance / month (prepay 3 years)

EDIT: I miscalculated the “3 year” price.

Nice things

Yes, you can connect with standard psql

You can also do many regularly tedious tasks with a push of a button

This is also nice:

Resume

For me the result is that mostly RedShift is on par with local PostgreSQL, sometimes even winning for <5M rows. With better PostgreSQL tuning you can probably stretch it, but not for as much as RedShift can do for REALLY big data.

My understanding is that you can start thinking of RedShift when you have significantly more than 10M rows to analyze.

Also the big deal was that RedShift scaled linearly. I added more nodes - I’ve got proportionally faster. Using SQL.

The local setup

PostgreSQL 9.2
Mint 13
default conf
in VirtualBox in iMac i5 12GB
(read: home computer, no tuning)

Source

require 'active_record'
require 'benchmark'
require 'pg'
def test_connection(conn, name)
 puts "test: #{name}"
 begin
 conn.exec("DROP TABLE items")
 rescue 
 end
 conn.exec("CREATE TABLE items (shop_code VARCHAR, article VARCHAR, price DECIMAL)")
 Benchmark.bm do |b|
 [10000, 100000, 1000000].each do |n|
 string = "BEGIN;"
 string += "INSERT INTO items (shop_code, article, price) VALUES ";
 values = []
 n.times {
 values << "('shop#{Random.rand(15)}', '#{Random.rand(1000)}', #{Random.rand(30*1000+995)})"
 }
 string += values.join(','); 
 string += "; commit;"
 b.report "insert #{n} in one call" do
 conn.exec(string)
 end
 end
 end
 Benchmark.bm do |b|
 5.times do
 b.report "select by shop_code and article" do
 conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' AND article='#{Random.rand(1000)}'") do |result|
 result.values
 end
 end
 end
 5.times do
 b.report "select by shop_code" do
 conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' limit 10000") do |result|
 result.values
 end
 end
 end
 end
end
conn_remote = PG.connect( 
 dbname: 'test1', 
 host: '*****.us-east-1.redshift.amazonaws.com',
 user: "redshift",
 password: '*****',
 port: 5439,
)
conn_local = PG.connect( 
 host: '127.0.0.1',
 dbname: 'redshift',
 user: 'redshift',
 password: '*****'
)
test_connection(conn_local, 'local')
test_connection(conn_remote, 'redshift')

Redshift: PostgreSQL-like in the cloud (benchmark) - RarestBlog

The prices

Nice things

Resume

The local setup

Source

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112