Contents

Setting up Apache Airflow on AWS EC2 instance

Setting up Airflow on AWS Linux was not direct, because of outdated default packages. For example I had trouble using setuid in Upstart config, because AWS Linux AMI came with 0.6.5 version of Upstart.

AMI Version: amzn-ami-hvm-2016.09.1.20161221-x86_64-gp2 (ami-c51e3eb6)

Install gcc, python-devel, and python-setuptools

1
sudo yum install gcc-c++ python-devel python-setuptools

Upgrade pip

1
sudo pip install --upgrade pip

Install airflow using pip

1
sudo /usr/local/bin/pip install airflow[s3, hive, python]

Create User and Group

1
2
3
sudo groupadd airflow
sudo useradd airflow -g airflow
sudo passwd -d airflow

This will create a password less user airflow

Initialize Airflow

1
2
3
su airflow
cd ~
airflow initdb

Test run

1
2
3
su airflow
cd ~
airflow webserver

You should be able to view Airflow ui at port 8080

Upstart Config for Airflow Webserver

Now let’s use upstart to manage Airflow process and respawning

This Amazon Linux AMI comes with Upstart 0.6.5, which is very sad. So setuid and setgid doesnot work.

airflow-webserver.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

description "Airflow webserver daemon"

start on runlevel [2345]
stop on runlevel [016]

respawn
respawn limit 5 30

env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME

pre-start script
  echo "starting airflow-webserver..." >> /var/log/airflow-webserver.log
  echo $AIRFLOW_HOME >> /var/log/airflow-webserver.log
  echo $AIRFLOW_CONFIG >> /var/log/airflow-webserver.log
end script

# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow  -- /usr/local/bin/airflow webserver  >> /var/log/airflow-webserver.log

pre-stop script
  echo "stopping airflow-webserver" >> /var/log/airflow-webserver.log
end script

You should be able to view airflow-webserver in initctl list

1
sudo initctl list

Start Airflow with upstart

1
sudo initctl start airflow-webserver

You can find the process id at /home/airflow/airflow/airflow-webserver.pid

Upstart Config for Airflow Scheduler

airflow-scheduler.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

description "Airflow scheduler daemon"

start on started networking
stop on (deconfiguring-networking or runlevel [016])

respawn
respawn limit 5 10

env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME

# required setting, 0 sets it to unlimited. Scheduler will restart after every X runs
env SCHEDULER_RUNS=5
export SCHEDULER_RUNS

# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow  -- /usr/local/bin/airflow scheduler -n ${SCHEDULER_RUNS}  >> /var/log/airflow-scheduler.log

Start Airflow Scheduler with upstart

1
sudo initctl start airflow-scheduler

This should keep Airflow Scheduler running in the background and respawn it in case of failures.

References