Setting up Airflow on AWS Linux was not direct, because of outdated default packages. For example I had trouble using setuid
in Upstart config, because AWS Linux AMI came with 0.6.5
version of Upstart.
AMI Version: amzn-ami-hvm-2016.09.1.20161221-x86_64-gp2 (ami-c51e3eb6)
Install gcc, python-devel, and python-setuptools
#
1
|
sudo yum install gcc-c++ python-devel python-setuptools
|
Upgrade pip
#
1
|
sudo pip install --upgrade pip
|
Install airflow using pip
#
1
|
sudo /usr/local/bin/pip install airflow[s3, hive, python]
|
Create User and Group
#
1
2
3
|
sudo groupadd airflow
sudo useradd airflow -g airflow
sudo passwd -d airflow
|
This will create a password less user airflow
Initialize Airflow
#
1
2
3
|
su airflow
cd ~
airflow initdb
|
Test run
#
1
2
3
|
su airflow
cd ~
airflow webserver
|
You should be able to view Airflow ui at port 8080
Upstart Config for Airflow Webserver
#
Now let’s use upstart to manage Airflow process and respawning
This Amazon Linux AMI comes with Upstart 0.6.5
, which is very sad. So setuid
and setgid
doesnot work.
airflow-webserver.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
description "Airflow webserver daemon"
start on runlevel [2345]
stop on runlevel [016]
respawn
respawn limit 5 30
env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME
pre-start script
echo "starting airflow-webserver..." >> /var/log/airflow-webserver.log
echo $AIRFLOW_HOME >> /var/log/airflow-webserver.log
echo $AIRFLOW_CONFIG >> /var/log/airflow-webserver.log
end script
# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow -- /usr/local/bin/airflow webserver >> /var/log/airflow-webserver.log
pre-stop script
echo "stopping airflow-webserver" >> /var/log/airflow-webserver.log
end script
|
You should be able to view airflow-webserver
in initctl list
Start Airflow with upstart
#
1
|
sudo initctl start airflow-webserver
|
You can find the process id at /home/airflow/airflow/airflow-webserver.pid
Upstart Config for Airflow Scheduler
#
airflow-scheduler.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
description "Airflow scheduler daemon"
start on started networking
stop on (deconfiguring-networking or runlevel [016])
respawn
respawn limit 5 10
env AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
env AIRFLOW_HOME=/home/airflow/airflow/
export AIRFLOW_CONFIG
export AIRFLOW_HOME
# required setting, 0 sets it to unlimited. Scheduler will restart after every X runs
env SCHEDULER_RUNS=5
export SCHEDULER_RUNS
# exec su -s /bin/sh -c 'exec "$0" "$@"' username -- /path/to/command [parameters...]
exec su -s /bin/sh -c 'exec "$0" "$@"' airflow -- /usr/local/bin/airflow scheduler -n ${SCHEDULER_RUNS} >> /var/log/airflow-scheduler.log
|
Start Airflow Scheduler with upstart
#
1
|
sudo initctl start airflow-scheduler
|
This should keep Airflow Scheduler running in the background and respawn it in case of failures.
References
#