OpenShift’s scalability feature makes it a very interesting platform to run big data analysis. Some very nice guys at Radanalytics made a great job in compiling some resources to make it simply.
Here is how it can be done.
Install your hadoop cluster
In this article, we won’t cover how to install and run a Hadoop cluster on OpenShift, even if this could be interesting and probably already done. So, to run our spark cluster and make it able to read data from Hadoop, we need an existing Hadoop instance. A standalone one is sufficient and it can be installed following these steps: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation
wget http://apache.crihan.fr/dist/hadoop/common/current/hadoop-3.1.0.tar.gz tar zxvf hadoop-3.1.0.tar.gz cd hadoop-3.1.0/ mkdir input cp etc/hadoop/*.xml input ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep input output 'dfs[a-z.]+' vim etc/hadoop/core-site.xml vim etc/hadoop/hdfs-site.xml bin/hdfs namenode -format sbin/start-dfs.sh
Install the spark cluster
oc create -f https://radanalytics.io/resources.yaml oc new-app --template=oshinko-webui