Troubleshooting — Ansible Deployments

When an Ansible deployment doesn't go as planned, this guide helps you figure out what happened, why, and how to fix it. It covers how to read Ansible's output, common errors organized by category, log file locations, and how to submit effective support tickets.

What's Covered in This Guide

How to Read Ansible Output
Connection and Authentication Errors
Package Installation Errors
Elasticsearch Errors
SSL Errors
Service Errors
General Ansible Errors
Health Check Commands
Log File Locations
Submitting a Support Ticket

How to Read Ansible Output

Understanding Task Output

As the playbook runs, you'll see tasks scroll by. Each task shows a role name, task name, and status:

TASK [elasticsearch : Start Elasticsearch] *************************************
ok: [10.0.1.40]

The format is: TASK [role : task name] followed by a status for each host.

Status	Meaning	Color
ok	Task completed successfully; no changes were needed	Green
changed	Task completed successfully and made a change on the target	Yellow
skipping	Task was skipped (a condition was not met)	Cyan
fatal	Task failed — this is what to look for	Red

When a task fails, Ansible prints a fatal: message with error details:

TASK [elasticsearch : Start Elasticsearch] *************************************
fatal: [10.0.1.40]: FAILED! => {"changed": false, "msg": "Unable to start service elasticsearch: ..."}

The msg field contains the specific error message. Read it carefully — it usually tells you exactly what went wrong.

Reading the PLAY RECAP

At the end of every run, Ansible prints a summary for each host:

PLAY RECAP *********************************************************************
10.0.1.10   : ok=71   changed=31   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
10.0.1.40   : ok=12   changed=8    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Field	Meaning
ok	Tasks that completed successfully (no change needed)
changed	Tasks that made a change on the target host
unreachable	Host could not be contacted via SSH
failed	Tasks that failed — this is what to look for
skipped	Tasks that were skipped (conditional not met)
rescued	Tasks that failed but were recovered by a rescue block
ignored	Tasks that failed but were set to `ignore_errors: true`

A successful run has failed=0 and unreachable=0 for every host.

Using the Log File

Ansible logs all output to ./ansible.log in the playbook directory. This log captures everything — including errors that may have scrolled past in the terminal.

Find errors quickly:

grep -n "fatal:" ansible.log

See context around a failure:

grep -B 5 -A 10 "fatal:" ansible.log

Find which task failed:

grep -B 2 "fatal:" ansible.log

This shows the TASK line above the fatal: line, telling you which role and task failed.

Connection and Authentication Errors

These errors occur when Ansible can't connect to or authenticate with a target host.

"Permission denied (publickey,password)"

What you see:

fatal: [10.0.1.10]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh: Permission denied (publickey,password)"}

Cause: The SSH username or password in your inventory is incorrect, or the target host doesn't allow password authentication.

How to fix:

Verify ansible_user and ansible_ssh_pass in inventory/hosts.yml are correct
Test SSH access manually from the control machine:
```
ssh diskover@10.0.1.10
```
If using SSH keys, verify ansible_ssh_private_key_file points to the correct key
On the target host, check if password authentication is allowed:
```
grep PasswordAuthentication /etc/ssh/sshd_config
```

"UNREACHABLE! ... No route to host" or "Connection timed out"

What you see:

fatal: [10.0.1.10]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh: ssh: connect to host 10.0.1.10 port 22: No route to host"}

Cause: The target host is not reachable from the control machine — the IP is wrong, the host is down, or there's a network barrier between them.

How to fix:

Verify the IP address in inventory/hosts.yml is correct
Test basic network connectivity:
```
ping 10.0.1.10
```
Test SSH port access:
```
nc -zv 10.0.1.10 22
```
Check for firewalls or network ACLs between the control machine and target
Verify the target machine is powered on and has networking configured

"to use the 'ssh' connection type with passwords, you must install the sshpass program"

What you see:

fatal: [10.0.1.10]: FAILED! => {"msg": "to use the 'ssh' connection type with passwords, you must install the sshpass program"}

Cause: sshpass is not installed on the control machine. It's required when using password-based SSH authentication (i.e., when ansible_ssh_pass is set in the inventory).

How to fix:

# RHEL/Rocky/Fedora
sudo dnf install sshpass

# macOS
brew install sshpass

# Ubuntu/Debian
sudo apt install sshpass

"Missing sudo password"

What you see:

fatal: [10.0.1.10]: FAILED! => {"msg": "Missing sudo password"}

Cause: ansible_become_pass is not set in your inventory, or the password is incorrect.

How to fix:

Add or correct ansible_become_pass under all.vars in inventory/hosts.yml
If the target user has passwordless sudo, you can remove ansible_become_pass entirely
Test sudo manually:
```
ssh diskover@10.0.1.10 "sudo whoami"
```
This should return root

Package Installation Errors

These errors occur when the playbook can't download or install packages.

"No package matching '...' found available"

What you see:

fatal: [10.0.1.10]: FAILED! => {"msg": "No package matching 'diskover-web-2.5.0' found available, installed or updated"}

Cause: The JFrog Artifactory repository is not configured correctly, the credentials are wrong, or the specified version doesn't exist.

How to fix:

Verify jfrog_user and jfrog_pass in all.yml are correct
Verify diskover_version matches a version that exists in JFrog
Check if the repo file was created on the target:
RHEL/Rocky/CentOS:
```
cat /etc/yum.repos.d/diskover.repo
```
Debian/Ubuntu:
```
cat /etc/apt/sources.list.d/diskover.list
```

"Cannot retrieve repository metadata" or "Failed to download metadata"

What you see:

fatal: [10.0.1.10]: FAILED! => {"msg": "Failed to download metadata for repo 'diskover'"}

Cause: The target host cannot reach JFrog Artifactory. This could be a DNS issue, firewall blocking outbound HTTPS, or a proxy requirement.

How to fix:

If the environment requires a proxy, use the proxy inventory and playbook (see the Running Playbooks guide)
If the environment has no internet access, switch to offline installation (see the Offline / Air-Gapped Installation guide)

Elasticsearch Errors

"Elasticsearch on first node is not reachable or not healthy"

What you see:

fatal: [10.0.1.40]: FAILED! => {"msg": "Elasticsearch on first node is not reachable or not healthy"}

Cause: Elasticsearch failed to start or didn't become healthy in time. Common reasons include insufficient memory, full disk, or a port conflict.

How to fix:

SSH into the Elasticsearch host and check the service:
```
systemctl status elasticsearch
```
Verify es_heap_size in all.yml is appropriate:
- Must not exceed half of available RAM
- Must not exceed 31 GB
```
free -h   # Check available RAM on the ES host
```
Check available disk space:
```
df -h
```
Check if port 9200 is already in use:
```
ss -tlnp | grep 9200
```

Elasticsearch Memory Lock Failed

What you see (in Elasticsearch logs):

memory locking requested for elasticsearch process but memory is not locked

Cause: The operating system isn't allowing Elasticsearch to lock its process memory. This happens when es_memory_lock: true but the systemd configuration doesn't permit it.

How to fix:

Check the systemd service override for LimitMEMLOCK=infinity:
```
systemctl show elasticsearch | grep LimitMEMLOCK
```
Verify the ES heap size doesn't exceed available RAM
Check /etc/security/limits.conf for memlock settings
Re-run the playbook — the Elasticsearch role configures these settings, and a re-run usually resolves it

SSL Errors

"Certificate file not found" or "Key file not found"

What you see:

fatal: [10.0.1.10]: FAILED! => {"msg": "Certificate file not found at /path/to/diskover.crt"}

Cause: The ssl_cert_source or ssl_key_source path in all.yml doesn't exist on the Ansible control machine.

How to fix:

Verify the file paths are correct and the files exist on the control machine (not the target host)
Use absolute paths (e.g., /root/certs/diskover.crt, not ~/certs/diskover.crt)
Check file permissions — the user running Ansible must be able to read the files

Python SSL Verification Fails After Enabling SSL

Symptoms: After enabling SSL, Diskover Admin or other Python-based services fail with certificate verification errors.

Cause: The Python certifi certificate bundle on the target host hasn't been updated with your custom certificate.

How to fix:

Set ssl_force_reconfigure: true in all.yml

Re-run the playbook:

time ansible-playbook -i inventory/hosts.yml install_diskover.yml --limit web

After verifying SSL works, set ssl_force_reconfigure back to false

Service Errors

Diskover Admin Fails to Start

What you see:

$ systemctl status diskover-admin
● diskover-admin.service - Uvicorn instance to serve /diskover-admin
   Active: failed

How to fix:

Check the Diskover Admin logs:

journalctl -u diskover-admin --no-pager -n 50

Common causes and solutions:

Cause	Solution
Python dependencies are missing or broken	Reinstall: `python3 -m pip install -r /var/www/diskover-admin/etc/requirements.txt`
Port 8000 is already in use	Check: `ss -tlnp \\| grep 8000` — stop the conflicting process
File permissions issue	Check ownership: `ls -la /var/www/diskover-admin/`

RabbitMQ Plugin Error (`{:badrpc}`)

What you see:

TASK [rabbitmq : Enable RabbitMQ management plugin] ****************************
fatal: [10.0.1.20]: FAILED! => {"msg": "...{:badrpc..."}

Cause: The RabbitMQ service is not running, or there's an Erlang version mismatch.

How to fix:

Check RabbitMQ status:
```
systemctl status rabbitmq-server
```
Verify Erlang is installed:
```
erl -version
```
Restart RabbitMQ and try again:
```
systemctl restart rabbitmq-server
```

Nginx "502 Bad Gateway"

Symptoms: The browser shows "502 Bad Gateway" when accessing a Diskover Web UI.

Cause: Nginx is running but the upstream service it's proxying to is not responding. The upstream service depends on which UI you're accessing:

Diskover Web UI (http://<web-host-ip>) — proxies to PHP-FPM (the PHP 8.4 runtime that serves the diskover-web PHP application)
Diskover Admin UI (http://<web-host-ip>:8000) — proxies to Uvicorn (the ASGI server running the Flask/FastAPI diskover-admin application)

Use the troubleshooting steps below based on which UI is showing the 502.

If the Diskover Web UI (PHP) is showing 502:

Check if Nginx is running:
```
systemctl status nginx
```
Check if PHP-FPM is running:
```
systemctl status php-fpm
```

Restart either service if not active:

systemctl restart nginx
systemctl restart php-fpm

Check the Nginx error logs — these will usually surface the true error (e.g. upstream timeout, permission denied on the PHP-FPM socket, or socket path mismatch):
```
tail -50 /var/log/nginx/error.log
```
Check the Nginx configuration:
```
cat /etc/nginx/conf.d/diskover-web.conf
```
Verify PHP-FPM is listening on its socket:
```
ss -xlnp | grep php
```
If the Nginx logs didn't give you a clear answer, check the PHP-FPM error logs as a last resort (these aren't always helpful):
```
tail -50 /var/opt/remi/php84/log/php-fpm/error.log
```

If the Diskover Admin UI is showing 502:

Check if Diskover Admin is running:
```
systemctl status diskover-admin
```
Restart if needed:
```
systemctl restart diskover-admin
```
Verify the upstream service is listening on port 8000:
```
ss -tlnp | grep 8000
```
(If SSL is enabled, also check port 443)
Check Nginx error logs:
```
tail -50 /var/log/nginx/error.log
```

General Ansible Errors

"No hosts matched" or "Could not match supplied host pattern"

What you see:

[WARNING]: Could not match supplied host pattern, ignoring: web

Cause: The group name in your --limit flag doesn't match any group in your inventory, or the inventory structure is incorrect.

How to fix:

Check your inventory group names:

ansible-inventory -i inventory/hosts.yml --list

Verify the --limit value matches a group name exactly (e.g., web, not Web — group names are case-sensitive)

Make sure your inventory follows the correct nesting:

all → children → diskover → children → web/rabbitmq/worker/elasticsearch

Task Hangs or Times Out

Cause: A task is waiting for a response that never comes — usually a package download on a slow or blocked network connection.

How to fix:

Increase the connection timeout in ansible.cfg:
```
timeout = 120
```
If behind a proxy, use the proxy configuration (see the Running Playbooks guide)

"Ansible version X is not supported" or Unexpected Module Behavior

Cause: You're running an Ansible version newer than 2.16.x. Some modules used in the playbook have breaking changes in 2.17+.

How to fix:

pip3 install ansible-core==2.16.5
ansible --version   # Confirm 2.16.x

Health Check Commands

After deployment or when diagnosing issues, use these commands to check the status of each component:

Component	Command	Expected Result
Elasticsearch (HTTP)	`curl -s http://localhost:9200/_cluster/health`	`"status": "green"` or `"yellow"`
Elasticsearch (HTTPS, security enabled)	`curl -sk -u elastic:<password> https://localhost:9200/_cluster/health`	`"status": "green"` or `"yellow"`
Elasticsearch service	`systemctl status elasticsearch`	`active (running)`
RabbitMQ	`rabbitmqctl status`	Running, listeners on port 5672
Diskover Web UI (HTTP)	Open `http://<web-host-ip>:8000/login.php` in a browser	Login page loads
Diskover Web UI (HTTPS, SSL enabled)	Open `https://<ssl_domain>/login.php` in a browser	Login page loads with valid certificate
Diskover Admin (HTTP)	Open `http://<web-host-ip>:8000/diskover_admin/config/` in a browser	Admin dashboard loads
Diskover Admin (HTTPS, SSL enabled)	Open `https://<ssl_domain>/diskover_admin/config/` in a browser	Admin dashboard loads with valid certificate
Diskoverd	`systemctl status diskoverd`	`active (running)`
Celery	`systemctl status celery`	`active (running)`
Nginx	`systemctl status nginx`	`active (running)`
PHP-FPM	`systemctl status php-fpm`	`active (running)`

Note on the Elasticsearch HTTPS check: When es_security_enabled: true was used during deployment, the playbook generates the elastic user password and saves it to /root/.config/elastic.txt on the first Elasticsearch node. Retrieve the password from there (or from your corporate password manager if you've already scrubbed the file per the initial-install security recommendation). The -k flag in the curl command skips TLS certificate verification, which is useful when Elasticsearch is using a self-signed certificate — omit it if you have a properly trusted CA certificate in place.

Log File Locations

When troubleshooting, these are the log files and journal commands to check for each component. The File Location column is where the component writes its own log files on disk, while the systemd Journal column shows the journalctl command to view startup and service logs.

Component	File Location	systemd Journal
Ansible	`./ansible.log` (in the playbook directory on the control machine)	—
Elasticsearch	`/var/log/elasticsearch/` (or the path set by `es_log_dir`)	`journalctl -u elasticsearch`
Diskoverd	`/var/log/diskover/`	`journalctl -fu diskoverd`
Celery	`/opt/diskover/diskover_celery/log/celery.log`	`journalctl -u celery`
Nginx	`/var/log/nginx/error.log` and `/var/log/nginx/access.log`	`journalctl -u nginx`
PHP-FPM	`/var/opt/remi/php84/log/php-fpm/error.log`	`journalctl -u php-fpm`
Diskover Admin	`/var/log/diskover/`	`journalctl -u diskover-admin`
RabbitMQ	`/var/log/rabbitmq/`	`journalctl -u rabbitmq-server`
Kibana	`/var/log/kibana/`	`journalctl -u kibana`

Tip: For any journalctl command, add -f to follow the log in real time (great for watching a service start up), or -n 100 to view the last 100 lines. Example: journalctl -fu diskoverd follows diskoverd logs as they're written.

Submitting a Support Ticket

If you can't resolve an issue using this guide, submit a ticket to Diskover Support.

Support portal: https://support.diskoverdata.com

Knowledge base: https://support.diskoverdata.com/hc/en-us

What to Include

When submitting a ticket, include as much of the following as possible to help the support team diagnose the issue quickly:

The ansible.log file — Located in the playbook directory on your control machine. This is the single most useful artifact for diagnosing issues
The PLAY RECAP output — Shows which hosts succeeded and which failed
Target machine OS and version:
```
cat /etc/os-release
```
Ansible version:
```
ansible --version
```
A description of what you were doing — Was this a first install, upgrade, re-run, or targeted deployment with --limit?
The specific error message — Copy the fatal: line(s) from the output
Any relevant service logs — If a service failed to start, include the output of journalctl -u <service> --no-pager -n 50

The more context you provide upfront, the faster the support team can help you resolve the issue.

Troubleshooting — Ansible Deployments

What's Covered in This Guide

How to Read Ansible Output

Understanding Task Output

Reading the PLAY RECAP

Using the Log File

Connection and Authentication Errors

"Permission denied (publickey,password)"

"UNREACHABLE! ... No route to host" or "Connection timed out"

"to use the 'ssh' connection type with passwords, you must install the sshpass program"

"Missing sudo password"

Package Installation Errors

"No package matching '...' found available"

"Cannot retrieve repository metadata" or "Failed to download metadata"

Elasticsearch Errors

"Elasticsearch on first node is not reachable or not healthy"

Elasticsearch Memory Lock Failed

SSL Errors

"Certificate file not found" or "Key file not found"

Python SSL Verification Fails After Enabling SSL

Service Errors

Diskover Admin Fails to Start

RabbitMQ Plugin Error ({:badrpc})

Nginx "502 Bad Gateway"

General Ansible Errors

"No hosts matched" or "Could not match supplied host pattern"

Task Hangs or Times Out

"Ansible version X is not supported" or Unexpected Module Behavior

Health Check Commands

Log File Locations

Submitting a Support Ticket

What to Include

Related articles

RabbitMQ Plugin Error (`{:badrpc}`)