Comments:"Redshift: PostgreSQL-like in the cloud (benchmark) - RarestBlog"
URL:http://rarestblog.com/blog/2013/02/15/redshift-postgresql-in-the-cloud/
Amazon has opened it’s RedShift today.
Naturally I was curious about it’s performance. Basically it’s database with PostgreSQL-compatible protocol in the cloud
The full test results are under the cut, just a gist of it:
Insert 1M entries: local 8 seconds, RedShift 49 seconds.
Select by 2 rows (without indices): local 129ms seconds, RedShift ~54ms
Select by 1 row (without indices): local 44ms, RedShift ~448ms
(SELECT 1 was 100ms, so that's the round-trip time, I subtracted it
from RedShift's values)
The test is VERY SYNTHETIC. Don’t take it as a comprehensive comparison.
Of course this does not prove anything. But it seems that 1M rows is not the use case for RedShift. I did not try JOINs, maybe it’s better with JOINs.
Afterall, they do say it’s for Business Intelligence systems. They also say “petabyte-scale”. So let’s try to scale up…
After some more testing it seems like yes > 5M is where you start to feel the difference:
Items count: 6000000
select by shop_code and article 0.766868
select by shop_code 0.052745
test: redshift
select by shop_code and article 0.277642 (round-trip subtracted)
select by shop_code 0.179468 (round-trip subtracted)
And it’s quite obvious that the complex queries are running faster on RedShift.
Though it continues to grow also:
Items count: 10000000
test:local
select by shop_code and article 1.449393
select by shop_code 0.046846
test: redshift
select by shop_code and article 0.653305
select by shop_code 0.349823
Let’s try a weird self-join, maybe that’ll help:
SELECT * FROM items i1 INNER JOIN items i2 ON i1.shop_code=i2.shop_code LIMIT 10
self-join local 0.001838
self-join remote 2.838698
What a disappointment!
And finally I’ve got something:
SELECT * FROM items i1
INNER JOIN items i2 ON i1.shop_code=i2.shop_code
WHERE i1.article='578' LIMIT 10000
self-join local 1.394531
self-join remote 3.997136
Let’s try even more stress:
SELECT * FROM items i1
INNER JOIN items i2 ON i1.shop_code=i2.shop_code
INNER JOIN items i3 ON i2.article=i3.article
WHERE i1.article='921' LIMIT 10000
self-join local 1.432942
self-join remote^C^Z
Ok, I waited for a couple minutes and got bored. RedShift did not plan to respond to my non-sense.
SELECT count(*) FROM (SELECT * FROM items LIMIT 100) i1
INNER JOIN (SELECT * FROM items LIMIT 100) i2 ON i1.shop_code=i2.shop_code
WHERE i2.article='478'
self-join local 0.002615
self-join local 0.000733
self-join local 0.000586
self-join local 0.000573
self-join local 0.000542
self-join remote 25.864774
self-join remote 4.878707
self-join remote 5.463815
self-join remote 1.992176
self-join remote 1.796341
Then it hovered around 1.9 for a while. What if we kick it up a bit?
Getting weird:
SELECT count(*) FROM (SELECT * FROM items LIMIT 10000) i1
INNER JOIN (SELECT * FROM items LIMIT 10000) i2 ON i1.shop_code=i2.shop_code
WHERE i2.article='880'
self-join local 0.008671
self-join remote 1.448579
(I always do 5 tests and if the numbers are more or less the same - I keep only it)
Finally RedShift is doing something better:
SELECT count(*) FROM items i1
INNER JOIN (SELECT * FROM items LIMIT 100000) i2 ON i1.shop_code=i2.shop_code
WHERE i2.article='208'
self-join local 15.323511
self-join remote 8.412499
count(*) = 67 Mil
And finally:
Items count: 10000000
SELECT count(*) FROM items i1
INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code
WHERE i2.article='575'
self-join local 38.425009
self-join remote 23.762500
self-join remote 17.709055
self-join remote 14.058966
self-join remote 15.046434
self-join remote 15.029501
The count(*) is 250+Mil
So the plot looks like this:
EDIT: vertical = seconds to run a query, horizontal = number of millions of rows in count(*)
,
which is proportional to number of rows analyzed, which should be about 1000 times
bigger than count(*)
, since there were 1000 articles in test.
As a final accord I tried to scale the cluster to 4 smaller machines (which is x-large).
After scaling the same query went from 15.029501
to 8.644006
, but only once.
Then the cluster stopped responding:
$ ruby test.redshift.rb
pg.rb:38:in `initialize': could not connect to server: Connection timed out (PG::Error)
Is the server running on host "test1.****.us-east-1.redshift.amazonaws.com" (xxx.xxx.xxx.xxx) and accepting
In about 5 minutes it returned back to normal and in fact scaled linearly (I started 4 node cluster):
SELECT count(*) FROM items i1
INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code
WHERE i2.article='73'
self-join local 38.425009
self-join remote 3.040255
self-join remote 2.763937
self-join remote 3.071994
self-join remote 2.962063
self-join remote 2.567816
It was fun, but was time to say good bye!
There is also an interesting feature - snapshots. So basically, you can snapshot your data, destroy the cluster. I assume that then you don’t pay anything… But I might be wrong here. And then you can restore it.
The prices
0.850*24*31 = 632$ / smaller instance / month (monthly)
(2500+0.215*24*365)/12 = 365$ / smaller instance / month (prepay 1 year)
(3000+0.114*24*365*3)/12/3 = 166$ / smaller instance / month (prepay 3 years)
EDIT: I miscalculated the “3 year” price.
Nice things
Yes, you can connect with standard psql
You can also do many regularly tedious tasks with a push of a button
This is also nice:
Resume
For me the result is that mostly RedShift is on par with local PostgreSQL, sometimes even winning for <5M rows. With better PostgreSQL tuning you can probably stretch it, but not for as much as RedShift can do for REALLY big data.
My understanding is that you can start thinking of RedShift when you have significantly more than 10M rows to analyze.
Also the big deal was that RedShift scaled linearly. I added more nodes - I’ve got proportionally faster. Using SQL.
The local setup
- PostgreSQL 9.2
- Mint 13
- default conf
- in VirtualBox in iMac i5 12GB
- (read: home computer, no tuning)
Source
require 'active_record'
require 'benchmark'
require 'pg'
def test_connection(conn, name)
puts "test: #{name}"
begin
conn.exec("DROP TABLE items")
rescue
end
conn.exec("CREATE TABLE items (shop_code VARCHAR, article VARCHAR, price DECIMAL)")
Benchmark.bm do |b|
[10000, 100000, 1000000].each do |n|
string = "BEGIN;"
string += "INSERT INTO items (shop_code, article, price) VALUES ";
values = []
n.times {
values << "('shop#{Random.rand(15)}', '#{Random.rand(1000)}', #{Random.rand(30*1000+995)})"
}
string += values.join(',');
string += "; commit;"
b.report "insert #{n} in one call" do
conn.exec(string)
end
end
end
Benchmark.bm do |b|
5.times do
b.report "select by shop_code and article" do
conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' AND article='#{Random.rand(1000)}'") do |result|
result.values
end
end
end
5.times do
b.report "select by shop_code" do
conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' limit 10000") do |result|
result.values
end
end
end
end
end
conn_remote = PG.connect(
dbname: 'test1',
host: '*****.us-east-1.redshift.amazonaws.com',
user: "redshift",
password: '*****',
port: 5439,
)
conn_local = PG.connect(
host: '127.0.0.1',
dbname: 'redshift',
user: 'redshift',
password: '*****'
)
test_connection(conn_local, 'local')
test_connection(conn_remote, 'redshift')