Troubleshooting — Ansible Deployments
When an Ansible deployment doesn't go as planned, this guide helps you figure out what happened, why, and how to fix it. It covers how to read Ansible's output, common errors organized by category, log file locations, and how to submit effective support tickets.
What's Covered in This Guide
How to Read Ansible Output
Connection and Authentication Errors
Package Installation Errors
Elasticsearch Errors
SSL Errors
Service Errors
General Ansible Errors
Health Check Commands
Log File Locations
Submitting a Support Ticket
How to Read Ansible Output
Understanding Task Output
As the playbook runs, you'll see tasks scroll by. Each task shows a role name, task name, and status:
TASK [elasticsearch : Start Elasticsearch] ************************************* ok: [10.0.1.40]
The format is: TASK [role : task name] followed by a status for each host.
Status | Meaning | Color |
|---|---|---|
ok | Task completed successfully; no changes were needed | Green |
changed | Task completed successfully and made a change on the target | Yellow |
skipping | Task was skipped (a condition was not met) | Cyan |
fatal | Task failed — this is what to look for | Red |
When a task fails, Ansible prints a fatal: message with error details:
TASK [elasticsearch : Start Elasticsearch] *************************************
fatal: [10.0.1.40]: FAILED! => {"changed": false, "msg": "Unable to start service elasticsearch: ..."}
The msg field contains the specific error message. Read it carefully — it usually tells you exactly what went wrong.
Reading the PLAY RECAP
At the end of every run, Ansible prints a summary for each host:
PLAY RECAP ********************************************************************* 10.0.1.10 : ok=71 changed=31 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0 10.0.1.40 : ok=12 changed=8 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
Field | Meaning |
|---|---|
ok | Tasks that completed successfully (no change needed) |
changed | Tasks that made a change on the target host |
unreachable | Host could not be contacted via SSH |
failed | Tasks that failed — this is what to look for |
skipped | Tasks that were skipped (conditional not met) |
rescued | Tasks that failed but were recovered by a rescue block |
ignored | Tasks that failed but were set to |
A successful run has failed=0 and unreachable=0 for every host.
Using the Log File
Ansible logs all output to ./ansible.log in the playbook directory. This log captures everything — including errors that may have scrolled past in the terminal.
Find errors quickly:
grep -n "fatal:" ansible.log
See context around a failure:
grep -B 5 -A 10 "fatal:" ansible.log
Find which task failed:
grep -B 2 "fatal:" ansible.log
This shows the TASK line above the fatal: line, telling you which role and task failed.
Connection and Authentication Errors
These errors occur when Ansible can't connect to or authenticate with a target host.
"Permission denied (publickey,password)"
What you see:
fatal: [10.0.1.10]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh: Permission denied (publickey,password)"}
Cause: The SSH username or password in your inventory is incorrect, or the target host doesn't allow password authentication.
How to fix:
Verify
ansible_userandansible_ssh_passininventory/hosts.ymlare correctTest SSH access manually from the control machine:
ssh diskover@10.0.1.10
If using SSH keys, verify
ansible_ssh_private_key_filepoints to the correct keyOn the target host, check if password authentication is allowed:
grep PasswordAuthentication /etc/ssh/sshd_config
"UNREACHABLE! ... No route to host" or "Connection timed out"
What you see:
fatal: [10.0.1.10]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh: ssh: connect to host 10.0.1.10 port 22: No route to host"}
Cause: The target host is not reachable from the control machine — the IP is wrong, the host is down, or there's a network barrier between them.
How to fix:
Verify the IP address in
inventory/hosts.ymlis correctTest basic network connectivity:
ping 10.0.1.10
Test SSH port access:
nc -zv 10.0.1.10 22
Check for firewalls or network ACLs between the control machine and target
Verify the target machine is powered on and has networking configured
"to use the 'ssh' connection type with passwords, you must install the sshpass program"
What you see:
fatal: [10.0.1.10]: FAILED! => {"msg": "to use the 'ssh' connection type with passwords, you must install the sshpass program"}
Cause: sshpass is not installed on the control machine. It's required when using password-based SSH authentication (i.e., when ansible_ssh_pass is set in the inventory).
How to fix:
# RHEL/Rocky/Fedora sudo dnf install sshpass # macOS brew install sshpass # Ubuntu/Debian sudo apt install sshpass
"Missing sudo password"
What you see:
fatal: [10.0.1.10]: FAILED! => {"msg": "Missing sudo password"}
Cause: ansible_become_pass is not set in your inventory, or the password is incorrect.
How to fix:
Add or correct
ansible_become_passunderall.varsininventory/hosts.ymlIf the target user has passwordless sudo, you can remove
ansible_become_passentirelyTest sudo manually:
ssh diskover@10.0.1.10 "sudo whoami"
This should return
root
Package Installation Errors
These errors occur when the playbook can't download or install packages.
"No package matching '...' found available"
What you see:
fatal: [10.0.1.10]: FAILED! => {"msg": "No package matching 'diskover-web-2.5.0' found available, installed or updated"}
Cause: The JFrog Artifactory repository is not configured correctly, the credentials are wrong, or the specified version doesn't exist.
How to fix:
Verify
jfrog_userandjfrog_passinall.ymlare correctVerify
diskover_versionmatches a version that exists in JFrogCheck if the repo file was created on the target:
RHEL/Rocky/CentOS:
cat /etc/yum.repos.d/diskover.repo
Debian/Ubuntu:
cat /etc/apt/sources.list.d/diskover.list
"Cannot retrieve repository metadata" or "Failed to download metadata"
What you see:
fatal: [10.0.1.10]: FAILED! => {"msg": "Failed to download metadata for repo 'diskover'"}
Cause: The target host cannot reach JFrog Artifactory. This could be a DNS issue, firewall blocking outbound HTTPS, or a proxy requirement.
How to fix:
If the environment requires a proxy, use the proxy inventory and playbook (see the Running Playbooks guide)
If the environment has no internet access, switch to offline installation (see the Offline / Air-Gapped Installation guide)
Elasticsearch Errors
"Elasticsearch on first node is not reachable or not healthy"
What you see:
fatal: [10.0.1.40]: FAILED! => {"msg": "Elasticsearch on first node is not reachable or not healthy"}
Cause: Elasticsearch failed to start or didn't become healthy in time. Common reasons include insufficient memory, full disk, or a port conflict.
How to fix:
SSH into the Elasticsearch host and check the service:
systemctl status elasticsearch
Verify
es_heap_sizeinall.ymlis appropriate:Must not exceed half of available RAM
Must not exceed 31 GB
free -h # Check available RAM on the ES host
Check available disk space:
df -h
Check if port 9200 is already in use:
ss -tlnp | grep 9200
Elasticsearch Memory Lock Failed
What you see (in Elasticsearch logs):
memory locking requested for elasticsearch process but memory is not locked
Cause: The operating system isn't allowing Elasticsearch to lock its process memory. This happens when es_memory_lock: true but the systemd configuration doesn't permit it.
How to fix:
Check the systemd service override for
LimitMEMLOCK=infinity:systemctl show elasticsearch | grep LimitMEMLOCK
Verify the ES heap size doesn't exceed available RAM
Check
/etc/security/limits.conffor memlock settingsRe-run the playbook — the Elasticsearch role configures these settings, and a re-run usually resolves it
SSL Errors
"Certificate file not found" or "Key file not found"
What you see:
fatal: [10.0.1.10]: FAILED! => {"msg": "Certificate file not found at /path/to/diskover.crt"}
Cause: The ssl_cert_source or ssl_key_source path in all.yml doesn't exist on the Ansible control machine.
How to fix:
Verify the file paths are correct and the files exist on the control machine (not the target host)
Use absolute paths (e.g.,
/root/certs/diskover.crt, not~/certs/diskover.crt)Check file permissions — the user running Ansible must be able to read the files
Python SSL Verification Fails After Enabling SSL
Symptoms: After enabling SSL, Diskover Admin or other Python-based services fail with certificate verification errors.
Cause: The Python certifi certificate bundle on the target host hasn't been updated with your custom certificate.
How to fix:
Set
ssl_force_reconfigure: trueinall.ymlRe-run the playbook:
time ansible-playbook -i inventory/hosts.yml install_diskover.yml --limit web
After verifying SSL works, set
ssl_force_reconfigureback tofalse
Service Errors
Diskover Admin Fails to Start
What you see:
$ systemctl status diskover-admin ● diskover-admin.service - Uvicorn instance to serve /diskover-admin Active: failed
How to fix:
Check the Diskover Admin logs:
journalctl -u diskover-admin --no-pager -n 50
Common causes and solutions:
Cause | Solution |
|---|---|
Python dependencies are missing or broken | Reinstall: |
Port 8000 is already in use | Check: |
File permissions issue | Check ownership: |
RabbitMQ Plugin Error ({:badrpc})
What you see:
TASK [rabbitmq : Enable RabbitMQ management plugin] ****************************
fatal: [10.0.1.20]: FAILED! => {"msg": "...{:badrpc..."}
Cause: The RabbitMQ service is not running, or there's an Erlang version mismatch.
How to fix:
Check RabbitMQ status:
systemctl status rabbitmq-server
Verify Erlang is installed:
erl -version
Restart RabbitMQ and try again:
systemctl restart rabbitmq-server
Nginx "502 Bad Gateway"
Symptoms: The browser shows "502 Bad Gateway" when accessing a Diskover Web UI.
Cause: Nginx is running but the upstream service it's proxying to is not responding. The upstream service depends on which UI you're accessing:
Diskover Web UI (
http://<web-host-ip>) — proxies to PHP-FPM (the PHP 8.4 runtime that serves the diskover-web PHP application)Diskover Admin UI (
http://<web-host-ip>:8000) — proxies to Uvicorn (the ASGI server running the Flask/FastAPI diskover-admin application)
Use the troubleshooting steps below based on which UI is showing the 502.
If the Diskover Web UI (PHP) is showing 502:
Check if Nginx is running:
systemctl status nginx
Check if PHP-FPM is running:
systemctl status php-fpm
Restart either service if not active:
systemctl restart nginx systemctl restart php-fpm
Check the Nginx error logs — these will usually surface the true error (e.g. upstream timeout, permission denied on the PHP-FPM socket, or socket path mismatch):
tail -50 /var/log/nginx/error.log
Check the Nginx configuration:
cat /etc/nginx/conf.d/diskover-web.conf
Verify PHP-FPM is listening on its socket:
ss -xlnp | grep php
If the Nginx logs didn't give you a clear answer, check the PHP-FPM error logs as a last resort (these aren't always helpful):
tail -50 /var/opt/remi/php84/log/php-fpm/error.log
If the Diskover Admin UI is showing 502:
Check if Diskover Admin is running:
systemctl status diskover-admin
Restart if needed:
systemctl restart diskover-admin
Verify the upstream service is listening on port 8000:
ss -tlnp | grep 8000
(If SSL is enabled, also check port 443)
Check Nginx error logs:
tail -50 /var/log/nginx/error.log
General Ansible Errors
"No hosts matched" or "Could not match supplied host pattern"
What you see:
[WARNING]: Could not match supplied host pattern, ignoring: web
Cause: The group name in your --limit flag doesn't match any group in your inventory, or the inventory structure is incorrect.
How to fix:
Check your inventory group names:
ansible-inventory -i inventory/hosts.yml --list
Verify the
--limitvalue matches a group name exactly (e.g.,web, notWeb— group names are case-sensitive)Make sure your inventory follows the correct nesting:
all → children → diskover → children → web/rabbitmq/worker/elasticsearch
Task Hangs or Times Out
Cause: A task is waiting for a response that never comes — usually a package download on a slow or blocked network connection.
How to fix:
Increase the connection timeout in
ansible.cfg:timeout = 120
If behind a proxy, use the proxy configuration (see the Running Playbooks guide)
"Ansible version X is not supported" or Unexpected Module Behavior
Cause: You're running an Ansible version newer than 2.16.x. Some modules used in the playbook have breaking changes in 2.17+.
How to fix:
pip3 install ansible-core==2.16.5 ansible --version # Confirm 2.16.x
Health Check Commands
After deployment or when diagnosing issues, use these commands to check the status of each component:
Component | Command | Expected Result |
|---|---|---|
Elasticsearch (HTTP) |
|
|
Elasticsearch (HTTPS, security enabled) |
|
|
Elasticsearch service |
|
|
RabbitMQ |
| Running, listeners on port 5672 |
Diskover Web UI (HTTP) | Open | Login page loads |
Diskover Web UI (HTTPS, SSL enabled) | Open | Login page loads with valid certificate |
Diskover Admin (HTTP) | Open | Admin dashboard loads |
Diskover Admin (HTTPS, SSL enabled) | Open | Admin dashboard loads with valid certificate |
Diskoverd |
|
|
Celery |
|
|
Nginx |
|
|
PHP-FPM |
|
|
Note on the Elasticsearch HTTPS check: When
es_security_enabled: truewas used during deployment, the playbook generates the elastic user password and saves it to/root/.config/elastic.txton the first Elasticsearch node. Retrieve the password from there (or from your corporate password manager if you've already scrubbed the file per the initial-install security recommendation). The-kflag in the curl command skips TLS certificate verification, which is useful when Elasticsearch is using a self-signed certificate — omit it if you have a properly trusted CA certificate in place.
Log File Locations
When troubleshooting, these are the log files and journal commands to check for each component. The File Location column is where the component writes its own log files on disk, while the systemd Journal column shows the journalctl command to view startup and service logs.
Component | File Location | systemd Journal |
|---|---|---|
Ansible |
| — |
Elasticsearch |
|
|
Diskoverd |
|
|
Celery |
|
|
Nginx |
|
|
PHP-FPM |
|
|
Diskover Admin |
|
|
RabbitMQ |
|
|
Kibana |
|
|
Tip: For any
journalctlcommand, add-fto follow the log in real time (great for watching a service start up), or-n 100to view the last 100 lines. Example:journalctl -fu diskoverdfollows diskoverd logs as they're written.
Submitting a Support Ticket
If you can't resolve an issue using this guide, submit a ticket to Diskover Support.
Support portal: https://support.diskoverdata.com
Knowledge base: https://support.diskoverdata.com/hc/en-us
What to Include
When submitting a ticket, include as much of the following as possible to help the support team diagnose the issue quickly:
The
ansible.logfile — Located in the playbook directory on your control machine. This is the single most useful artifact for diagnosing issuesThe PLAY RECAP output — Shows which hosts succeeded and which failed
Target machine OS and version:
cat /etc/os-release
Ansible version:
ansible --version
A description of what you were doing — Was this a first install, upgrade, re-run, or targeted deployment with
--limit?The specific error message — Copy the
fatal:line(s) from the outputAny relevant service logs — If a service failed to start, include the output of
journalctl -u <service> --no-pager -n 50
The more context you provide upfront, the faster the support team can help you resolve the issue.
Comments
0 comments
Please sign in to leave a comment.