Apache sqoop

Sqoop

Posted on Updated on

Apache Sqoop is part of Hadoop Eco System used for data migration. Using Sqoop we can move data from/to RDBMS to/from HDFS or Hive or HBase.

Starting with Sqoop:

http://www.sqoop.apache.org is the official website for downloading a stable Sqoop version.
After downloading, untar it in your Ubuntu setup where Hadoop is installed.

tar xzf

Edit .bashrc in your teminal and update the PATH variable

open .bashrc using ‘vi .bashrc’ , go to the end of the file and enter below lines:

export SQOOP_PREFIX = enter sqoop dir path here
export PATH=$PATH:$SQOOP_PREFIX/bin

Yes, you are ready to use Sqoop now. Close the terminal and open a fresh terminal.

Sqoop commands:

Type ‘Sqoop help’ in the terminal to know the commands available with sqoop. Sqoop import and export are the common commands used for exporting and importing data in and out of Hadoop.

We need database specific JDBC drivers to use sqoop import/export. For Instance, to import/export data from PostgreSQL we need to place postgres JDBC jar under /lib folder.

PostgreSQL jdbc jar could be “postgresql-9.4-1201.jdbc4.jar”

MySQL jdbc jar could be “mysql-connector-java-5.1.30-bin.jar”.

Most database vendors provide JDBC drivers to connect to database. We need to download it from their websites and place it under sqoop/lib dir.

Typical import command for importing data from database to HDFS would be :

postgres:

sqoop import –connect jdbc:postgresql://:5432/postgres –username postgres –password –table < table name >

MySql:

sqoop import -connect jdbc:mysql://:3306/ -username -password –table < table name >
–target-dir < enter target directory >

Note:  By default, 4 Map task would be created to speed up the process. We can pass an argument -m to reduce or increase the jobs.The output files by default will use comma as the delimiter.

Export from HDFS to database:

sqoop export –connect jdbc:mysql://:3306/ -m 1 \
–table < table name >
–export-dir < enter export directory >