WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework. Hadoop facilitates scaling BAM to handle large data volumes and uses Apache Hive for creating and executing analytic jobs. By default, Hive submits analytic jobs to a Hadoop instance running in local mode. But, you can set up a multi-node Hadoop cluster externally, and point to it.
Although BAM uses Apache Hadoop's MapReduce technology underneath, you do not have to write complex Hadoop jobs to process data. BAM decouples you from this underlying complexities and enables you to write data processing queries in SQL-like Hive query language. Hive provides you the right level of abstraction from Hadoop while internally submitting the analytic jobs to Hadoop. It spawns a Hadoop JVM internally or delegates to a Hadoop cluster.
Configuring the Hadoop cluster
Let's see how to configure the Hadoop cluster. Execute the following steps in all nodes in the BAM deployment unless otherwise specified (see here for more information on this).
- Install Java in a location that all the user groups can access. For example,
apt-getin order to copy the Hadoop configurations across all nodes.
- Create a user by the name hadoop with the command:
- Log in as user hadoop using the command:
su - hadoop
Key exchange for Passphraseless SSH
- We need to have password/passphraseless SSH to communicate with other Hadoop nodes in the cluster. To establish an SSH with another node from node1, use the command:
ssh hadoop@node2. To avoid this command requesting a password, set up the SSH key exchange among the Hadoop nodes.
- Generate a key for the name node using the command:
ssh-keygen. It creates a .ssh directory inside the user account of the user hadoop.
- Inside the generated .ssh directory, there is a file with the key. Append this public key of node1 to the
authorized_keysfile in the other Hadoop nodes (node2 to node5) by executing the following commands and copying the
id_rsa.pubfile into the other nodes.
scp id_rsa.pub hadoop@node2:/home/hadoop
scp id_rsa.pub hadoop@node3:/home/hadoop
scp id_rsa.pub hadoop@ node4:/home/hadoop
scp id_rsa.pub hadoop@node5:/home/hadoop
- Log in to the second Hadoop node's hadoop user account and establish an SSH connection to another node from it . Use the command:
ssh hadoop@node3. It creates the .ssh directory in the hadoop account.
- Append the copied public key to the
authorized_keyfile in the hadoop account of node2. Execute the following commands.
cat /home/hadoop/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
chown hadoop:hadoop /home/hadoop/.ssh/authorized_keys
chmod 600 /home/hadoop/.ssh/authorized_keys
- Now you can ssh to node1 from node2 without a password prompt. Log in to the Master node. From the hadoop account, log in to node2 using either of the commands:
ssh -i id_rsa.pub email@example.com
If you still cannot establish an SSH connection to node2 without a password, run the following commands to node2.
- Carry out steps 4 to 6 on all other nodes as well. (node3 to node 5).
Configuring the master node
<HADOOP_HOME>refers to the path to Hadoop installation directory throughout this guide.
<HADOOP_HOME>/conf/core-site.xmlfile as follows:
<HADOOP_HOME>/conf/mastersfile and enter the following into it:
In node1 and node2, edit
<HADOOP_HOME>/conf/hadoop-policy.xmlas follows. It enables write access for Hadoop user to Hadoop nodes.
Syncing Hadoop configurations across all nodes
- Log in to the Master Hadoop node's hadoop account. From the Hadoop installation directory, execute the command below in order to propagate Hadoop configurations and binaries to node2:
rsync -a -e ssh . hadoop@node2:/home/hadoop/hadoop.
- Remove records from
slavesin node2 files.
- Be sure to repeat step 18 and 19 above in all other nodes (node 3 to node 5).
- From the master node's Hadoop installation directory, execute the following command to format the namenode:
bin/hadoop namenode -format.
Start the name node with the command:
sh start-all.sh. All nodes should be started simultaneously .See here to view the most common issues that you would encounter when setting up a multi-node Hadoop cluster.