Airflow
Airflow Integration.
Version |
0.7.0 (View all) |
Compatible Kibana version(s) |
8.11.0 or higher |
Supported Serverless project types |
Security Observability |
Subscription level |
Basic |
Level of support |
Elastic |
Overview
Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. It allows users to define workflows as Directed Acyclic Graphs (DAGs) of tasks, which are then executed by the Airflow scheduler on an array of workers while following the specified dependencies.
Use the Airflow integration to:
- Collect detailed metrics from Airflow using StatsD to gain insights into system performance.
- Create informative visualizations to track usage trends, measure key metrics, and derive actionable business insights.
- Monitor your workflows' performance and status in real-time.
Data streams
The Airflow integration gathers metric data.
Metrics provide insight into the statistics of Airflow. The Metric
data stream collected by the Airflow integration is statsd
, enabling users to monitor and troubleshoot the performance of the Airflow instance.
Data stream:
statsd
: Collects metrics related to scheduler activities, pool usage, task execution details, executor performance, and worker states in Airflow.
Note:
- Users can monitor and view metrics within the ingested documents for Airflow in the
metrics-*
index pattern fromDiscover
.
Compatibility
The Airflow module is tested with Airflow 2.4.0
. It should work with versions 2.0.0
and later.
Prerequisites
Users require Elasticsearch to store and search user data, and Kibana to visualize and manage it. They can utilize the hosted Elasticsearch Service on Elastic Cloud, which is recommended, or self-manage the Elastic Stack on their own hardware.
To ingest data from Airflow, users must have StatsD to receive the same.
Setup
For step-by-step instructions on how to set up an integration, see the Getting started guide.
Steps to Setup Airflow
Be sure to follow the official Airflow Installation Guide for the correct installation of Airflow.
Include the following lines in the user's Airflow configuration file (e.g. airflow.cfg
). Leave statsd_prefix
empty and replace %HOST%
with the address where the Agent is running:
[metrics]
statsd_on = True
statsd_host = %HOST%
statsd_port = 8125
statsd_prefix =
Validation
Once the integration is set up, you can click on the Assets tab in the Airflow integration to see a list of available dashboards. Choose the dashboard that corresponds to your configured data stream. The dashboard should be populated with the required data.
Troubleshooting
- Check if the StatsD server is receiving data from Airflow by examining the logs for potential errors.
- Make sure the
%HOST%
placeholder in the Airflow configuration file is replaced with the correct address of the machine where the StatsD server is running. - If Airflow metrics are not being emitted, confirm that the
[metrics]
section in theairflow.cfg
file is properly configured as per the instructions above.
Metrics reference
Statsd
This is the statsd
data stream, which collects metrics related to scheduler activities, pool usage, task execution details, executor performance, and worker states in Airflow.
An example event for statsd
looks as following:
{
"@timestamp": "2023-11-28T06:26:54.238Z",
"agent": {
"ephemeral_id": "283d2103-181e-4e55-990c-d463765d591a",
"id": "208488b1-ba3d-4035-b968-4202e1fadc05",
"name": "docker-fleet-agent",
"type": "metricbeat",
"version": "8.11.0"
},
"airflow": {
"task_executable": {
"value": 0
}
},
"data_stream": {
"dataset": "airflow.statsd",
"namespace": "ep",
"type": "metrics"
},
"ecs": {
"version": "8.5.1"
},
"elastic_agent": {
"id": "208488b1-ba3d-4035-b968-4202e1fadc05",
"snapshot": false,
"version": "8.11.0"
},
"event": {
"agent_id_status": "verified",
"dataset": "airflow.statsd",
"ingested": "2023-11-28T06:26:55Z",
"module": "statsd"
},
"host": {
"architecture": "x86_64",
"containerized": true,
"hostname": "docker-fleet-agent",
"id": "d7fd92f5e61644938d48518adcee73ad",
"ip": "172.20.0.7",
"mac": "02-42-AC-14-00-07",
"name": "docker-fleet-agent",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "3.10.0-1160.90.1.el7.x86_64",
"name": "Ubuntu",
"platform": "ubuntu",
"type": "linux",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"metricset": {
"name": "server"
},
"service": {
"type": "statsd"
}
}
Exported fields
Field | Description | Type | Metric Type |
---|---|---|---|
@timestamp | Event timestamp. | date | |
agent.id | keyword | ||
airflow.*.count | Airflow counters | object | counter |
airflow.*.max | Airflow max timers metric | object | |
airflow.*.mean | Airflow mean timers metric | object | |
airflow.*.mean_rate | Airflow mean rate timers metric | object | |
airflow.*.median | Airflow median timers metric | object | |
airflow.*.min | Airflow min timers metric | object | |
airflow.*.stddev | Airflow standard deviation timers metric | object | |
airflow.*.value | Airflow gauges | object | gauge |
airflow.dag_file | Airflow dag file metadata | keyword | |
airflow.dag_id | Airflow dag id metadata | keyword | |
airflow.job_name | Airflow job name metadata | keyword | |
airflow.operator_name | Airflow operator name metadata | keyword | |
airflow.pool_name | Airflow pool name metadata | keyword | |
airflow.scheduler_heartbeat.count | Airflow scheduler heartbeat | double | |
airflow.status | Airflow status metadata | keyword | |
airflow.task_id | Airflow task id metadata | keyword | |
cloud.account.id | The cloud account or organization id used to identify different entities in a multi-tenant environment. Examples: AWS account id, Google Cloud ORG Id, or other unique identifier. | keyword | |
cloud.availability_zone | Availability zone in which this host is running. | keyword | |
cloud.image.id | Image ID for the cloud instance. | keyword | |
cloud.instance.id | Instance ID of the host machine. | keyword | |
cloud.instance.name | Instance name of the host machine. | keyword | |
cloud.machine.type | Machine type of the host machine. | keyword | |
cloud.project.id | Name of the project in Google Cloud. | keyword | |
cloud.provider | Name of the cloud provider. Example values are aws, azure, gcp, or digitalocean. | keyword | |
cloud.region | Region in which this host is running. | keyword | |
container.id | Unique container id. | keyword | |
container.image.name | Name of the image the container was built on. | keyword | |
container.labels | Image labels. | object | |
container.name | Container name. | keyword | |
container.runtime | Runtime managing this container. | keyword | |
data_stream.dataset | Data stream dataset. | constant_keyword | |
data_stream.namespace | Data stream namespace. | constant_keyword | |
data_stream.type | Data stream type. | constant_keyword | |
ecs.version | ECS version this event conforms to. ecs.version is a required field and must exist in all events. When querying across multiple indices -- which may conform to slightly different ECS versions -- this field lets integrations adjust to the schema version of the events. | keyword | |
event.dataset | Event dataset | constant_keyword | |
event.module | Event module | constant_keyword | |
host | A host is defined as a general computing instance. ECS host.* fields should be populated with details about the host on which the event happened, or from which the measurement was taken. Host types include hardware, virtual machines, Docker containers, and Kubernetes nodes. | group | |
host.architecture | Operating system architecture. | keyword | |
host.containerized | If the host is a container. | boolean | |
host.domain | Name of the domain of which the host is a member. For example, on Windows this could be the host's Active Directory domain or NetBIOS domain name. For Linux this could be the domain of the host's LDAP provider. | keyword | |
host.hostname | Hostname of the host. It normally contains what the hostname command returns on the host machine. | keyword | |
host.id | Unique host id. As hostname is not always unique, use values that are meaningful in your environment. Example: The current usage of beat.name . | keyword | |
host.ip | Host ip addresses. | ip | |
host.mac | Host mac addresses. | keyword | |
host.name | Name of the host. It can contain what hostname returns on Unix systems, the fully qualified domain name, or a name specified by the user. The sender decides which value to use. | keyword | |
host.os.build | OS build information. | keyword | |
host.os.codename | OS codename, if any. | keyword | |
host.os.family | OS family (such as redhat, debian, freebsd, windows). | keyword | |
host.os.kernel | Operating system kernel version as a raw string. | keyword | |
host.os.name | Operating system name, without the version. | keyword | |
host.os.name.text | Multi-field of host.os.name . | text | |
host.os.platform | Operating system platform (such centos, ubuntu, windows). | keyword | |
host.os.version | Operating system version as a raw string. | keyword | |
host.type | Type of host. For Cloud providers this can be the machine type like t2.medium . If vm, this could be the container, for example, or other information meaningful in your environment. | keyword | |
service.address | Service address | keyword | |
service.type | The type of the service data is collected from. The type can be used to group and correlate logs and metrics from one service type. Example: If logs or metrics are collected from Elasticsearch, service.type would be elasticsearch . | keyword |
Changelog
Version | Details | Kibana version(s) |
---|---|---|
0.7.0 | Enhancement View pull request | — |
0.6.0 | Enhancement View pull request | — |
0.5.1 | Bug fix View pull request | — |
0.5.0 | Enhancement View pull request | — |
0.4.0 | Enhancement View pull request | — |
0.3.1 | Bug fix View pull request | — |
0.3.0 | Enhancement View pull request | — |
0.2.0 | Enhancement View pull request | — |
0.1.0 | Enhancement View pull request | — |
0.0.5 | Bug fix View pull request | — |
0.0.4 | Enhancement View pull request | — |
0.0.3 | Enhancement View pull request | — |
0.0.2 | Enhancement View pull request | — |
0.0.1 | Enhancement View pull request | — |