OpenShift’s scalability feature makes it a very interesting platform to run big data analysis. Some very nice guys at Radanalytics made a great job in compiling some resources to make it simply.
Here is how it can be done.

Install your hadoop cluster
In this article, we won’t cover how to install and run a Hadoop cluster on OpenShift, even if this could be interesting and probably already done. So, to run our spark cluster and make it able to read data from Hadoop, we need an existing Hadoop instance. A standalone one is sufficient and it can be installed following these steps: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation

wget http://apache.crihan.fr/dist/hadoop/common/current/hadoop-3.1.0.tar.gz
tar zxvf hadoop-3.1.0.tar.gz
cd hadoop-3.1.0/
mkdir input
cp etc/hadoop/*.xml input
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep input output 'dfs[a-z.]+'
vim etc/hadoop/core-site.xml
vim etc/hadoop/hdfs-site.xml
bin/hdfs namenode -format
sbin/start-dfs.sh

Install the spark cluster

oc create -f https://radanalytics.io/resources.yaml
oc new-app --template=oshinko-webui
Last modified: 1st December 2019

Author