Install additional Python packages on Azure HDInsights cluster
An HDInsights cluster consists of several nodes. A PySpark script may run on any one of those nodes.
In order to install a package on HDInsights we need to use Script Actions
Use Script Action
Script Action allows you to run a shell script on all nodes of an Azure cluster. You may run the script action when you create the cluster or while the cluster is running.
You may also persist a Script Action in the cluster, so that it is run on every new node that is added to the cluster.
Pip install script action
This script will install and update the given packages in the parameters section.
For example, you may install a new version of
Plotly to use in your
Upload your Script Action
Your script action must be in a location accessible to the cluster. It could be anywhere, but let's upload it to storage account.
Upload your Script Action to a Storage Account that your chosen HDInsights have access to.
Bash script URI
- Select the cluster that you want to run the Script action in Azure portal
- On the blade under Configuration, you can find Script Actions
- Select that and then click Submit New
- Select custom
- Give a name
- Select nodes to install
- In the parameters section give list of packages to install separated by spaces
We can also add this script to HDInsights ARM template in the `computeProfile` as `scriptActions`.
NB: You may create other Script Actions to do any kind of customisations on your HDInsights Cluster