Excellent post on understanding Joins with MapReduce
I have been reading on Join implementations available for Hadoop for past few days. In this post I recap some techniques I learnt during the process. The joins can be done at both Map side and Join side according to the nature of data sets of to be joined.
Reduce Side Join
Let’s take the following tables containing employee and department data.
Let’s see how join query below can be achieved using reduce side join.
Map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. However it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. Following…
View original post 409 more words