1. Posts/

Setting up Apache Airflow on AWS EC2 instance

···
help aws Airflow

Setting up Airflow on AWS Linux was not direct, because of outdated default packages. For example I had trouble using setuid in Upstart config, because AWS Linux AMI came with 0.6.5 version of Upstart.

AMI Version: amzn-ami-hvm-2016.09.1.20161221-x86_64-gp2 (ami-c51e3eb6)

Install gcc, python-devel, and python-setuptools
#

1
sudo yum install gcc-c++ python-devel python-setuptools

Upgrade pip
#

1
sudo pip install --upgrade pip

Install airflow using pip
#

1
sudo /usr/local/bin/pip install airflow[s3, hive, python]

Create User and Group
#

1
2
3
sudo groupadd airflow
sudo useradd airflow -g airflow
sudo passwd -d airflow

This will create a password less user airflow

Initialize Airflow
#

1
2
3
su airflow
cd ~
airflow initdb

Test run
#

1
2
3
su airflow
cd ~
airflow webserver

You should be able to view Airflow ui at port 8080

Upstart Config for Airflow Webserver
#

Now let’s use upstart to manage Airflow process and respawning

This Amazon Linux AMI comes with Upstart 0.6.5, which is very sad. So setuid and setgid doesnot work.

airflow-webserver.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

description "Airflow webserver daemon"

start on runlevel [2345]
stop on runlevel [016]

respawn
respawn limit 5 30

env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME

pre-start script
  echo "starting airflow-webserver..." >> /var/log/airflow-webserver.log
  echo $AIRFLOW_HOME >> /var/log/airflow-webserver.log
  echo $AIRFLOW_CONFIG >> /var/log/airflow-webserver.log
end script

# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow  -- /usr/local/bin/airflow webserver  >> /var/log/airflow-webserver.log

pre-stop script
  echo "stopping airflow-webserver" >> /var/log/airflow-webserver.log
end script

You should be able to view airflow-webserver in initctl list

1
sudo initctl list

Start Airflow with upstart
#

1
sudo initctl start airflow-webserver

You can find the process id at /home/airflow/airflow/airflow-webserver.pid

Upstart Config for Airflow Scheduler
#

airflow-scheduler.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

description "Airflow scheduler daemon"

start on started networking
stop on (deconfiguring-networking or runlevel [016])

respawn
respawn limit 5 10

env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME

# required setting, 0 sets it to unlimited. Scheduler will restart after every X runs
env SCHEDULER_RUNS=5
export SCHEDULER_RUNS

# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow  -- /usr/local/bin/airflow scheduler -n ${SCHEDULER_RUNS}  >> /var/log/airflow-scheduler.log

Start Airflow Scheduler with upstart
#

1
sudo initctl start airflow-scheduler

This should keep Airflow Scheduler running in the background and respawn it in case of failures.

References
#