Archive for the ‘bigdata’ Category

select a/sum(a) from foo; -- WRONG!

will not work because sum(a) would work on each row and will return the number present in that row resulting in all 1. since sum(n) = n where n is a single number.

here is where we can make good use of  hive windowing and analytics functions.

the solution is to use the following:

select a/sum(a) over () from foo; -- RIGHT !!!

here we are instructing that the sum be performed over the entire column ‘a’.

you could of course make 2 queries and calculate the sum of a first and then hardcode the sum in another query but that means you are performing 2 passes over the table for sure. now i can’t be sure how many passes over the table are being made in the right query above. if you know the answer please post in the comments below.

Read Full Post »