Azure Blob Storage
License: PRO (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview / Use Cases
The Azure Blob Storage scanner brings your cloud storage into Diskover, giving you full search and discovery capabilities across your Azure blob containers — right alongside your on-premises file systems. Instead of manually browsing the Azure Portal or writing scripts to understand what's in your storage accounts, you can index everything into Diskover and search it just like any other data source.
The scanner translates Azure's flat blob namespace into the familiar hierarchical folder structure you're used to seeing in Diskover. It captures Azure-specific metadata like storage tiers and ETags, so you can build smart workflows around cost optimization and change tracking.
Here are some common scenarios where this scanner shines:
Cloud Migration Visibility — If you're migrating data into Azure, the scanner lets you index your containers after each migration phase to verify all expected files made it to their destination. You can create periodic snapshots during long-running migrations to track progress and catch stalled transfers early.
Storage Cost Optimization — Azure offers Hot, Cool, and Archive tiers at different price points. The scanner indexes tier information so you can search for old data sitting in expensive Hot storage, identify large blobs that should be archived, and verify that your Azure lifecycle policies are actually moving data as expected.
Hybrid Storage Management — When your data lives across on-premises NAS and Azure cloud, you need a single place to search everything. Index both your local file shares and Azure containers into Diskover for unified search and consolidated storage reporting — no more switching between tools to find what you need.
Compliance and Data Governance — Use Diskover's search and tagging capabilities combined with Azure metadata to track data residency, identify sensitive data stored in cloud containers, and generate audit-ready reports of your cloud storage footprint.
Sample metadata from Azure Blob scan:
With each file / folder in the Azure Blob scan we get the Tier and ETag value for the object!
Understanding Azure Blob Storage
Azure Blob Storage is Microsoft's massively scalable object storage service designed for unstructured data — documents, images, videos, backups, logs, and virtually any file type. If you're new to Azure Blob Storage or want to understand how the scanner maps to Azure concepts, this section will help.
Key Concepts
Azure Concept | Description | Diskover Equivalent |
|---|---|---|
Storage Account | Top-level Azure resource that contains all your blob data | Configured in scanner credentials |
Container | A logical grouping of blobs (similar to a root folder) | Appears as the top-level directory in the index (e.g., |
Blob | An individual file stored in a container | Indexed as a file entry in Diskover |
Virtual Directory | Blob name prefixes separated by | Indexed as directories in Diskover |
Storage Tier | Cost/performance class: Hot, Cool, or Archive | Indexed as |
ETag | A unique identifier that changes when a blob is modified | Indexed as |
Storage Tiers
Understanding storage tiers is important because they directly impact your costs. The scanner captures tier information so you can analyze and optimize your spending:
Tier | Best For | Access Cost | Storage Cost |
|---|---|---|---|
Hot | Frequently accessed data | Lowest | Highest |
Cool | Infrequently accessed (30+ days) | Moderate | Lower |
Archive | Rarely accessed (180+ days) | Highest (retrieval fees + latency) | Lowest |
Authentication Methods
The scanner supports two ways to connect to Azure. Choose the one that fits your organization's security posture:
Method | Best For | Setup Complexity | Security |
|---|---|---|---|
Connection String | Quick setup, dev/test, simple deployments | Low — copy from Azure Portal | Moderate — contains account key |
Azure AD Service Principal | Enterprise environments with centralized identity | Moderate — requires app registration | High — role-based, secret rotation supported |
Tip: For production deployments, Azure AD credentials are recommended because they let you assign the minimum required role (Storage Blob Data Reader) and support secret rotation without regenerating storage account keys.
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher |
Diskover | Core installation with alternate scanner support |
Network | HTTPS access to Azure Blob Storage endpoints ( |
Python Dependencies
Package | Version | Purpose |
|---|---|---|
| 12.x | Azure Blob Storage SDK for blob enumeration and property access |
| 1.x | Azure AD authentication via service principal credentials |
Azure Prerequisites
Requirement | Description |
|---|---|
Storage Account | Azure Storage account with Blob Storage enabled |
Authentication | Connection string OR Azure AD service principal with storage access |
Container Access | Read access to blob containers — Storage Blob Data Reader role at minimum |
Network Access | Firewall / VNet rules must permit access from the Diskover scanner host |
Permissions
For connection string authentication:
Obtain the connection string from Azure Portal > Storage Account > Security + networking > Access Keys
For Azure AD credentials authentication:
Create an Azure AD (Entra ID) application registration
Create a client secret for the application
Grant the application the Storage Blob Data Reader role on the storage account
Record the Tenant ID, Client ID, and Client Secret
Installation
Step 1: Install Scanner Package
Linux:
dnf install diskover-scanner-azure
Windows:
The scanner files are included with the Diskover Windows installation. No separate installation step is required.
Install locations:
Linux:
/opt/diskover/scanners/scandir_azure/Windows:
C:\Program Files\Diskover\scanners\scandir_azure\
Step 2: Install Python Dependencies
Install the required Azure SDK packages:
Linux:
python3 -m pip install azure-storage-blob azure-identity
Windows:
python -m pip install azure-storage-blob azure-identity
Or install specific versions for compatibility:
python3 -m pip install -r /opt/diskover/scanners/scandir_azure/requirements.txt
Step 3: Verify Installation
Confirm the Azure packages are installed correctly:
python3 -c "from azure.storage.blob import BlobServiceClient; print('azure-storage-blob: OK')"
python3 -c "from azure.identity import ClientSecretCredential; print('azure-identity: OK')"
Check installed versions:
python3 -m pip show azure-storage-blob azure-identity | grep -E "^(Name|Version)"
Step 4: Configure Authentication
Choose one of the two authentication methods below. Authentication can be configured via environment variables, TOML configuration files, or the Diskover Admin UI (see the Configuration section for full details).
Option A: Connection String Authentication
Linux:
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123...;EndpointSuffix=core.windows.net"
Windows:
$env:AZURE_STORAGE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123...;EndpointSuffix=core.windows.net"
Option B: Azure AD Credentials Authentication
Linux:
export AZURE_STORAGE_BLOB_URL="https://mystorageaccount.blob.core.windows.net" export AZURE_TENANT_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" export AZURE_CLIENT_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" export AZURE_CLIENT_SECRET="your-client-secret-value"
Windows:
$env:AZURE_STORAGE_BLOB_URL = "https://mystorageaccount.blob.core.windows.net" $env:AZURE_TENANT_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" $env:AZURE_CLIENT_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" $env:AZURE_CLIENT_SECRET = "your-client-secret-value"
Step 5: Verify Azure Connectivity
Test connectivity to your Azure storage account:
python3 -c "
from azure.storage.blob import BlobServiceClient
import os
conn_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
client = BlobServiceClient.from_connection_string(conn_str)
containers = list(client.list_containers())
print(f'Found {len(containers)} containers')
for c in containers[:5]:
print(f' - {c.name}')
"
If you're using Azure AD credentials, adapt the test to use ClientSecretCredential instead.
Sample output of the Azure Conneectivity test :
python3 -c " from azure.storage.blob import BlobServiceClient import os
conn_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
client = BlobServiceClient.from_connection_string(conn_str)
containers = list(client.list_containers())
print(f'Found {len(containers)} containers')
for c in containers[:5]:
print(f' - {c.name}')
"
Found 2 containers
- blob-acme-corp
- blob-storage-diskover
Configuration
Configuration is managed through the Diskover Admin UI, environment variables, or TOML configuration files. The scanner loads configuration from multiple sources in this order of precedence:
Environment Variables — Highest priority, overrides all other sources
TOML Configuration Files —
settings.tomland.secrets.tomlin the scanner directoryDiskover Admin UI — API-sourced configuration when available
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
|
|
| Authentication method to use for Azure connection |
| string |
| Azure Storage connection string (required for |
| string |
| Azure Storage Blob service URL, e.g., |
| string |
| Azure AD tenant ID (required for |
| string |
| Azure AD client/application ID (required for |
| secret string |
| Azure AD client secret (required for |
Configuration via Diskover Admin
Navigate to Settings > Alternate Scanners > Azure Scanner
Select the authentication method (
connectionstringorcredentials)Fill in the required fields for the chosen authentication method
Save the configuration
Environment Variable Overrides
Environment Variable | Configuration Field | Description |
|---|---|---|
|
| Azure Storage account connection string |
|
| Azure Storage Blob endpoint URL |
|
| Azure AD tenant ID |
|
| Azure AD application (client) ID |
|
| Azure AD client secret |
Configuration Examples
Example 1: Connection String via Environment Variables
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123def456...;EndpointSuffix=core.windows.net" cd /opt/diskover python3 diskover.py --altscanner scandir_azure az://mycontainer
Example 2: Azure AD Credentials via Environment Variables
export AZURE_STORAGE_BLOB_URL="https://mystorageaccount.blob.core.windows.net" export AZURE_TENANT_ID="12345678-1234-1234-1234-123456789abc" export AZURE_CLIENT_ID="87654321-4321-4321-4321-cba987654321" export AZURE_CLIENT_SECRET="your-secret-value-here" cd /opt/diskover python3 diskover.py --altscanner scandir_azure az://mycontainer
Example 3: TOML Configuration Files
Create or edit /opt/diskover/scanners/scandir_azure/settings.toml:
[scandir_azure] auth_method = "credentials" storage_blob_url = "https://mystorageaccount.blob.core.windows.net" tenant_id = "12345678-1234-1234-1234-123456789abc" client_id = "87654321-4321-4321-4321-cba987654321"
Store secrets separately in /opt/diskover/scanners/scandir_azure/.secrets.toml:
[scandir_azure] client_secret = "your-secret-value-here"
Secure the secrets file:
chmod 600 /opt/diskover/scanners/scandir_azure/.secrets.toml
Usage / Execution
This is a standard alternate scanner that integrates with diskover.py via the --altscanner flag.
Basic Usage
Scan a specific Azure blob container:
Linux:
cd /opt/diskover python3 diskover.py --altscanner scandir_azure az://mycontainer
Windows:
cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_azure az://mycontainer
Scan All Containers
Omit the container name to automatically discover and scan all containers in the storage account:
python3 diskover.py --altscanner scandir_azure az://
This creates separate top paths for each container (e.g., /mycontainer1, /mycontainer2).
Scan a Specific Blob Prefix
Target only a specific "folder" (blob prefix) within a container:
python3 diskover.py --altscanner scandir_azure az://mycontainer/data/2024
With Custom Index Name
Specify a custom Elasticsearch index name:
python3 diskover.py -i diskover-azure-data --altscanner scandir_azure az://mycontainer
Multiple Containers in a Single Index
Scan multiple specific containers into one index:
python3 diskover.py -i diskover-azure-all --altscanner scandir_azure az://container1 az://container2 az://container3
With Verbose Logging
Enable debug logging for troubleshooting:
python3 diskover.py --altscanner scandir_azure --loglevel DEBUG az://mycontainer
With Parallel Worker Threads
Use Diskover's parallel crawling for faster indexing of large containers:
python3 diskover.py --altscanner scandir_azure --threads 8 az://mycontainer
Path Format Reference
Path Format | Description | Example |
|---|---|---|
| Scan all containers in the storage account |
|
| Scan an entire container |
|
| Scan a specific blob prefix (subfolder) |
|
| Scan a prefix (trailing slash is optional) |
|
Index Path Mapping
Azure blob paths are converted to POSIX-style paths for indexing:
Azure Blob Path | Diskover Index Path |
|---|---|
|
|
|
|
|
|
Integration with Index Tasks
When configuring the scanner as part of a Diskover Index Task, use the following settings:
Field | Value |
|---|---|
Alternate Scanner |
|
Top Path |
|
Performance Tips
Thread count: Increase
--threadsfor containers with many blobs spread across multiple virtual directories. Start with 4–8 threads and tune based on Azure API throttling behavior.Prefix targeting: If you only need a subset of a large container, scan specific prefixes rather than the entire container to reduce scan time and API calls.
Network proximity: Running the scanner on an Azure VM in the same region as your storage account significantly reduces latency and scan time.
Multiple containers: When scanning many containers, using
az://for auto-discovery is convenient, but scanning individual containers with separate index tasks gives you more control over scheduling and error isolation.
Metadata Fields / Elasticsearch Mappings
The Azure scanner adds two custom metadata fields to every indexed document, capturing Azure-specific information that you can search and report on.
Field Mappings
Field Path | ES Type | Description |
|---|---|---|
|
| Azure blob ETag — a unique identifier that changes when a blob is modified. Useful for change detection and versioning. |
|
| Azure storage tier — |
Elasticsearch Mapping Definition
{
"mappings": {
"properties": {
"azure_etag": {
"type": "keyword"
},
"azure_tier": {
"type": "keyword"
}
}
}
}
Example Indexed Document (File)
{
"name": "financial_report_2024.xlsx",
"path": "/mycontainer/reports/financial/financial_report_2024.xlsx",
"path_parent": "/mycontainer/reports/financial",
"extension": "xlsx",
"size": 524288,
"size_du": 524288,
"mtime": "2024-06-15T14:30:00Z",
"atime": "2024-07-01T09:15:00Z",
"ctime": "2024-01-10T08:00:00Z",
"type": "file",
"azure_etag": "0x8DC1234567890AB",
"azure_tier": "Hot"
}
Example Indexed Document (Directory)
{
"name": "financial",
"path": "/mycontainer/reports/financial",
"path_parent": "/mycontainer/reports",
"type": "directory",
"azure_etag": "",
"azure_tier": ""
}
Note: Virtual directories (blob prefixes) do not have Azure ETags or tier information, so these fields will be empty for directory entries.
Searching in Diskover
Once Azure containers are indexed, you can use Diskover's standard search syntax to query Azure-specific metadata alongside regular file attributes. Here are practical search examples organized by common workflow.
Basic Tier Searches
Query | Description |
|---|---|
| Find all blobs in the Hot (most expensive) storage tier |
| Find all blobs in the Cool tier |
| Find all blobs in the Archive (cheapest) tier |
Cost Optimization Searches
Query | Description |
|---|---|
| Hot tier blobs larger than 1 GB — candidates for tiering down |
| Hot tier blobs not modified in over a year — likely should be Cool or Archive |
| Hot tier blobs over 100 MB and untouched for 90+ days |
| Cool tier blobs older than 180 days — candidates for Archive tier |
| Large video files in Hot storage — often good archive candidates |
Change Detection and Versioning
Query | Description |
|---|---|
| Find blobs with ETags matching a specific prefix pattern |
| Find all blobs that have an ETag (i.e., all files, excluding virtual directories) |
File Type Analysis
Query | Description |
|---|---|
| Archived compressed files |
| Log files in Hot storage — often candidates for cleanup or archiving |
| Backup files in Hot storage — should typically be moved to Cool or Archive |
| Hot tier files specifically in a backups directory |
Hybrid / Cross-Platform Searches
Query | Description |
|---|---|
| Find large PDFs across all indexed sources (Azure + on-premises) |
| Recently modified files in any Azure container |
| Find files with "report" in the name that are stored in Azure (the |
Combining with Diskover Tags
Query | Description |
|---|---|
| Hot tier blobs marked for deletion |
| Archived blobs with a retention tag |
Sample Diskover query output for Azure Tier = “hot”
This shows all content in the Azure Blob stroage that is on the HOT tier along with it’s ETag value!
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
| Azure SDK packages not installed | Run |
| Path doesn't start with | Ensure path uses the |
Scanner starts but finds 0 blobs | Container is empty or prefix doesn't match any blobs | Verify the container name and prefix; list blobs manually to confirm |
| No authentication credentials configured | Set environment variables or configure via Diskover Admin / TOML files |
|
| Check spelling in configuration — must be exactly |
Authentication Failures
Symptom: Scanner fails to start with authentication or authorization errors.
Diagnosis:
Verify which auth method is configured:
Linux:
echo "Connection String: ${AZURE_STORAGE_CONNECTION_STRING:0:50}..." echo "Blob URL: $AZURE_STORAGE_BLOB_URL" echo "Tenant ID: $AZURE_TENANT_ID"Windows:
Write-Output "Connection String: $($env:AZURE_STORAGE_CONNECTION_STRING.Substring(0,50))..." Write-Output "Blob URL: $env:AZURE_STORAGE_BLOB_URL" Write-Output "Tenant ID: $env:AZURE_TENANT_ID"
Test connection string validity:
python3 -c " from azure.storage.blob import BlobServiceClient import os try: client = BlobServiceClient.from_connection_string(os.getenv('AZURE_STORAGE_CONNECTION_STRING')) print('Connection string valid') except Exception as e: print(f'Error: {e}') "Test Azure AD credentials:
python3 -c " from azure.identity import ClientSecretCredential import os try: cred = ClientSecretCredential( os.getenv('AZURE_TENANT_ID'), os.getenv('AZURE_CLIENT_ID'), os.getenv('AZURE_CLIENT_SECRET') ) token = cred.get_token('https://storage.azure.com/.default') print('Credentials valid, token obtained') except Exception as e: print(f'Error: {e}') "
Resolution:
Verify the connection string has not expired or been regenerated in the Azure Portal
Ensure the Azure AD application has the Storage Blob Data Reader role
Check that the client secret has not expired
Verify tenant ID, client ID, and secret are correctly copied (no trailing whitespace)
Container Not Found
Symptom: Scanner reports the container is not found or access is denied for a specific container.
Diagnosis:
List available containers to confirm the name:
python3 -c "
from azure.storage.blob import BlobServiceClient
import os
client = BlobServiceClient.from_connection_string(os.getenv('AZURE_STORAGE_CONNECTION_STRING'))
for c in client.list_containers():
print(c.name)
"
Resolution:
Azure container names are case-sensitive — verify the exact name
Ensure the container exists in the correct storage account
For Azure AD auth, verify the role is assigned at the container or account level
Network / Firewall Issues
Symptom: Connection timeouts or network unreachable errors when accessing Azure storage.
Diagnosis:
Test basic HTTPS connectivity:
Linux:
curl -I "https://mystorageaccount.blob.core.windows.net/"
Windows:
Invoke-WebRequest -Uri "https://mystorageaccount.blob.core.windows.net/" -Method Head
Check DNS resolution:
Linux:
nslookup mystorageaccount.blob.core.windows.net
Windows:
Resolve-DnsName mystorageaccount.blob.core.windows.net
Test port connectivity:
Linux:
nc -zv mystorageaccount.blob.core.windows.net 443
Windows:
Test-NetConnection -ComputerName mystorageaccount.blob.core.windows.net -Port 443
Resolution:
Verify firewall allows outbound HTTPS (port 443) to
*.blob.core.windows.netIf using Azure Private Endpoints, ensure DNS is configured correctly
Configure a proxy if required by your network:
export HTTPS_PROXY="http://proxy.example.com:8080"Check Azure Storage firewall settings allow your scanner host's IP address
Permission Denied on Blobs
Symptom: Scanner connects but fails to read certain blobs or list container contents.
Resolution:
For connection string auth: verify the key has read and list permissions
For Azure AD auth: assign the Storage Blob Data Reader role at the account or container level
If using hierarchical namespace (ADLS Gen2), check container-level ACLs
If using a SAS token, verify it includes read and list permissions
Debug Logging
Enable detailed logging to diagnose any issue:
python3 /opt/diskover/diskover.py --altscanner scandir_azure --loglevel DEBUG az://mycontainer
Log file locations:
Linux:
/var/log/diskover/diskover.logWindows: Check Diskover service logs or the configured log location
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.