Azure Blob Storage

License: PRO (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.

Overview / Use Cases

The Azure Blob Storage scanner brings your cloud storage into Diskover, giving you full search and discovery capabilities across your Azure blob containers — right alongside your on-premises file systems. Instead of manually browsing the Azure Portal or writing scripts to understand what's in your storage accounts, you can index everything into Diskover and search it just like any other data source.

The scanner translates Azure's flat blob namespace into the familiar hierarchical folder structure you're used to seeing in Diskover. It captures Azure-specific metadata like storage tiers and ETags, so you can build smart workflows around cost optimization and change tracking.

Here are some common scenarios where this scanner shines:

Cloud Migration Visibility — If you're migrating data into Azure, the scanner lets you index your containers after each migration phase to verify all expected files made it to their destination. You can create periodic snapshots during long-running migrations to track progress and catch stalled transfers early.
Storage Cost Optimization — Azure offers Hot, Cool, and Archive tiers at different price points. The scanner indexes tier information so you can search for old data sitting in expensive Hot storage, identify large blobs that should be archived, and verify that your Azure lifecycle policies are actually moving data as expected.
Hybrid Storage Management — When your data lives across on-premises NAS and Azure cloud, you need a single place to search everything. Index both your local file shares and Azure containers into Diskover for unified search and consolidated storage reporting — no more switching between tools to find what you need.
Compliance and Data Governance — Use Diskover's search and tagging capabilities combined with Azure metadata to track data residency, identify sensitive data stored in cloud containers, and generate audit-ready reports of your cloud storage footprint.

Sample metadata from Azure Blob scan:

With each file / folder in the Azure Blob scan we get the Tier and ETag value for the object!

Understanding Azure Blob Storage

Azure Blob Storage is Microsoft's massively scalable object storage service designed for unstructured data — documents, images, videos, backups, logs, and virtually any file type. If you're new to Azure Blob Storage or want to understand how the scanner maps to Azure concepts, this section will help.

Key Concepts

Azure Concept	Description	Diskover Equivalent
Storage Account	Top-level Azure resource that contains all your blob data	Configured in scanner credentials
Container	A logical grouping of blobs (similar to a root folder)	Appears as the top-level directory in the index (e.g., `/mycontainer`)
Blob	An individual file stored in a container	Indexed as a file entry in Diskover
Virtual Directory	Blob name prefixes separated by `/` that simulate folders	Indexed as directories in Diskover
Storage Tier	Cost/performance class: Hot, Cool, or Archive	Indexed as `azure_tier` for search and reporting
ETag	A unique identifier that changes when a blob is modified	Indexed as `azure_etag` for change detection

Storage Tiers

Understanding storage tiers is important because they directly impact your costs. The scanner captures tier information so you can analyze and optimize your spending:

Tier	Best For	Access Cost	Storage Cost
Hot	Frequently accessed data	Lowest	Highest
Cool	Infrequently accessed (30+ days)	Moderate	Lower
Archive	Rarely accessed (180+ days)	Highest (retrieval fees + latency)	Lowest

Authentication Methods

The scanner supports two ways to connect to Azure. Choose the one that fits your organization's security posture:

Method	Best For	Setup Complexity	Security
Connection String	Quick setup, dev/test, simple deployments	Low — copy from Azure Portal	Moderate — contains account key
Azure AD Service Principal	Enterprise environments with centralized identity	Moderate — requires app registration	High — role-based, secret rotation supported

Tip: For production deployments, Azure AD credentials are recommended because they let you assign the minimum required role (Storage Blob Data Reader) and support secret rotation without regenerating storage account keys.

Requirements

System Requirements

Component	Requirement
Python	3.9 or higher
Diskover	Core installation with alternate scanner support
Network	HTTPS access to Azure Blob Storage endpoints (`*.blob.core.windows.net`) on port 443

Python Dependencies

Package	Version	Purpose
`azure-storage-blob`	12.x	Azure Blob Storage SDK for blob enumeration and property access
`azure-identity`	1.x	Azure AD authentication via service principal credentials

Azure Prerequisites

Requirement	Description
Storage Account	Azure Storage account with Blob Storage enabled
Authentication	Connection string OR Azure AD service principal with storage access
Container Access	Read access to blob containers — Storage Blob Data Reader role at minimum
Network Access	Firewall / VNet rules must permit access from the Diskover scanner host

Permissions

For connection string authentication:

Obtain the connection string from Azure Portal > Storage Account > Security + networking > Access Keys

For Azure AD credentials authentication:

Create an Azure AD (Entra ID) application registration
Create a client secret for the application
Grant the application the Storage Blob Data Reader role on the storage account
Record the Tenant ID, Client ID, and Client Secret

Installation

Step 1: Install Scanner Package

Linux:

dnf install diskover-scanner-azure

Windows:

The scanner files are included with the Diskover Windows installation. No separate installation step is required.

Install locations:

Linux: /opt/diskover/scanners/scandir_azure/
Windows: C:\Program Files\Diskover\scanners\scandir_azure\

Step 2: Install Python Dependencies

Install the required Azure SDK packages:

Linux:

python3 -m pip install azure-storage-blob azure-identity

Windows:

python -m pip install azure-storage-blob azure-identity

Or install specific versions for compatibility:

python3 -m pip install -r /opt/diskover/scanners/scandir_azure/requirements.txt

Step 3: Verify Installation

Confirm the Azure packages are installed correctly:

python3 -c "from azure.storage.blob import BlobServiceClient; print('azure-storage-blob: OK')"
python3 -c "from azure.identity import ClientSecretCredential; print('azure-identity: OK')"

Check installed versions:

python3 -m pip show azure-storage-blob azure-identity | grep -E "^(Name|Version)"

Step 4: Configure Authentication

Choose one of the two authentication methods below. Authentication can be configured via environment variables, TOML configuration files, or the Diskover Admin UI (see the Configuration section for full details).

Option A: Connection String Authentication

Linux:

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123...;EndpointSuffix=core.windows.net"

Windows:

$env:AZURE_STORAGE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123...;EndpointSuffix=core.windows.net"

Option B: Azure AD Credentials Authentication

Linux:

export AZURE_STORAGE_BLOB_URL="https://mystorageaccount.blob.core.windows.net"
export AZURE_TENANT_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
export AZURE_CLIENT_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
export AZURE_CLIENT_SECRET="your-client-secret-value"

Windows:

$env:AZURE_STORAGE_BLOB_URL = "https://mystorageaccount.blob.core.windows.net"
$env:AZURE_TENANT_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$env:AZURE_CLIENT_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$env:AZURE_CLIENT_SECRET = "your-client-secret-value"

Step 5: Verify Azure Connectivity

Test connectivity to your Azure storage account:

python3 -c "
from azure.storage.blob import BlobServiceClient
import os
conn_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
client = BlobServiceClient.from_connection_string(conn_str)
containers = list(client.list_containers())
print(f'Found {len(containers)} containers')
for c in containers[:5]:
    print(f'  - {c.name}')
"

If you're using Azure AD credentials, adapt the test to use ClientSecretCredential instead.

Sample output of the Azure Conneectivity test :

python3 -c "                                                                                                                                        from azure.storage.blob import BlobServiceClient                                                                                                                           import os
conn_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
client = BlobServiceClient.from_connection_string(conn_str)
containers = list(client.list_containers())
print(f'Found {len(containers)} containers')
for c in containers[:5]:
    print(f'  - {c.name}')
"



Found 2 containers
  - blob-acme-corp
  - blob-storage-diskover

Configuration

Configuration is managed through the Diskover Admin UI, environment variables, or TOML configuration files. The scanner loads configuration from multiple sources in this order of precedence:

Environment Variables — Highest priority, overrides all other sources
TOML Configuration Files — settings.toml and .secrets.toml in the scanner directory
Diskover Admin UI — API-sourced configuration when available

Configuration Parameters

Parameter	Type	Default	Description
`auth_method`	`connectionstring` or `credentials`	`connectionstring`	Authentication method to use for Azure connection
`connection_string`	string	`''`	Azure Storage connection string (required for `connectionstring` auth)
`storage_blob_url`	string	`''`	Azure Storage Blob service URL, e.g., `https://account.blob.core.windows.net` (required for `credentials` auth)
`tenant_id`	string	`''`	Azure AD tenant ID (required for `credentials` auth)
`client_id`	string	`''`	Azure AD client/application ID (required for `credentials` auth)
`client_secret`	secret string	`''`	Azure AD client secret (required for `credentials` auth)

Configuration via Diskover Admin

Navigate to Settings > Alternate Scanners > Azure Scanner
Select the authentication method (connectionstring or credentials)
Fill in the required fields for the chosen authentication method
Save the configuration

Environment Variable Overrides

Environment Variable	Configuration Field	Description
`AZURE_STORAGE_CONNECTION_STRING`	`connection_string`	Azure Storage account connection string
`AZURE_STORAGE_BLOB_URL`	`storage_blob_url`	Azure Storage Blob endpoint URL
`AZURE_TENANT_ID`	`tenant_id`	Azure AD tenant ID
`AZURE_CLIENT_ID`	`client_id`	Azure AD application (client) ID
`AZURE_CLIENT_SECRET`	`client_secret`	Azure AD client secret

Configuration Examples

Example 1: Connection String via Environment Variables

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=abc123def456...;EndpointSuffix=core.windows.net"

cd /opt/diskover
python3 diskover.py --altscanner scandir_azure az://mycontainer

Example 2: Azure AD Credentials via Environment Variables

export AZURE_STORAGE_BLOB_URL="https://mystorageaccount.blob.core.windows.net"
export AZURE_TENANT_ID="12345678-1234-1234-1234-123456789abc"
export AZURE_CLIENT_ID="87654321-4321-4321-4321-cba987654321"
export AZURE_CLIENT_SECRET="your-secret-value-here"

cd /opt/diskover
python3 diskover.py --altscanner scandir_azure az://mycontainer

Example 3: TOML Configuration Files

Create or edit /opt/diskover/scanners/scandir_azure/settings.toml:

[scandir_azure]
auth_method = "credentials"
storage_blob_url = "https://mystorageaccount.blob.core.windows.net"
tenant_id = "12345678-1234-1234-1234-123456789abc"
client_id = "87654321-4321-4321-4321-cba987654321"

Store secrets separately in /opt/diskover/scanners/scandir_azure/.secrets.toml:

[scandir_azure]
client_secret = "your-secret-value-here"

Secure the secrets file:

chmod 600 /opt/diskover/scanners/scandir_azure/.secrets.toml

Usage / Execution

This is a standard alternate scanner that integrates with diskover.py via the --altscanner flag.

Basic Usage

Scan a specific Azure blob container:

Linux:

cd /opt/diskover
python3 diskover.py --altscanner scandir_azure az://mycontainer

Windows:

cd "C:\Program Files\Diskover"
python diskover.py --altscanner scandir_azure az://mycontainer

Scan All Containers

Omit the container name to automatically discover and scan all containers in the storage account:

python3 diskover.py --altscanner scandir_azure az://

This creates separate top paths for each container (e.g., /mycontainer1, /mycontainer2).

Scan a Specific Blob Prefix

Target only a specific "folder" (blob prefix) within a container:

python3 diskover.py --altscanner scandir_azure az://mycontainer/data/2024

With Custom Index Name

Specify a custom Elasticsearch index name:

python3 diskover.py -i diskover-azure-data --altscanner scandir_azure az://mycontainer

Multiple Containers in a Single Index

Scan multiple specific containers into one index:

python3 diskover.py -i diskover-azure-all --altscanner scandir_azure az://container1 az://container2 az://container3

With Verbose Logging

Enable debug logging for troubleshooting:

python3 diskover.py --altscanner scandir_azure --loglevel DEBUG az://mycontainer

With Parallel Worker Threads

Use Diskover's parallel crawling for faster indexing of large containers:

python3 diskover.py --altscanner scandir_azure --threads 8 az://mycontainer

Path Format Reference

Path Format	Description	Example
`az://`	Scan all containers in the storage account	`az://`
`az://<container>`	Scan an entire container	`az://mycontainer`
`az://<container>/<prefix>`	Scan a specific blob prefix (subfolder)	`az://mycontainer/data/2024`
`az://<container>/<prefix>/`	Scan a prefix (trailing slash is optional)	`az://mycontainer/archives/`

Index Path Mapping

Azure blob paths are converted to POSIX-style paths for indexing:

Azure Blob Path	Diskover Index Path
`az://mycontainer`	`/mycontainer`
`az://mycontainer/folder/file.txt`	`/mycontainer/folder/file.txt`
`az://mycontainer/deep/nested/path/doc.pdf`	`/mycontainer/deep/nested/path/doc.pdf`

Integration with Index Tasks

When configuring the scanner as part of a Diskover Index Task, use the following settings:

Field	Value
Alternate Scanner	`scandir_azure`
Top Path	`az://mycontainer` (or `az://` for all containers)

Performance Tips

Thread count: Increase --threads for containers with many blobs spread across multiple virtual directories. Start with 4–8 threads and tune based on Azure API throttling behavior.
Prefix targeting: If you only need a subset of a large container, scan specific prefixes rather than the entire container to reduce scan time and API calls.
Network proximity: Running the scanner on an Azure VM in the same region as your storage account significantly reduces latency and scan time.
Multiple containers: When scanning many containers, using az:// for auto-discovery is convenient, but scanning individual containers with separate index tasks gives you more control over scheduling and error isolation.

Metadata Fields / Elasticsearch Mappings

The Azure scanner adds two custom metadata fields to every indexed document, capturing Azure-specific information that you can search and report on.

Field Mappings

Field Path	ES Type	Description
`azure_etag`	`keyword`	Azure blob ETag — a unique identifier that changes when a blob is modified. Useful for change detection and versioning.
`azure_tier`	`keyword`	Azure storage tier — `Hot`, `Cool`, or `Archive`. Used for cost analysis and tier optimization workflows.

Elasticsearch Mapping Definition

{
  "mappings": {
    "properties": {
      "azure_etag": {
        "type": "keyword"
      },
      "azure_tier": {
        "type": "keyword"
      }
    }
  }
}

Example Indexed Document (File)

{
  "name": "financial_report_2024.xlsx",
  "path": "/mycontainer/reports/financial/financial_report_2024.xlsx",
  "path_parent": "/mycontainer/reports/financial",
  "extension": "xlsx",
  "size": 524288,
  "size_du": 524288,
  "mtime": "2024-06-15T14:30:00Z",
  "atime": "2024-07-01T09:15:00Z",
  "ctime": "2024-01-10T08:00:00Z",
  "type": "file",
  "azure_etag": "0x8DC1234567890AB",
  "azure_tier": "Hot"
}

Example Indexed Document (Directory)

{
  "name": "financial",
  "path": "/mycontainer/reports/financial",
  "path_parent": "/mycontainer/reports",
  "type": "directory",
  "azure_etag": "",
  "azure_tier": ""
}

Note: Virtual directories (blob prefixes) do not have Azure ETags or tier information, so these fields will be empty for directory entries.

Searching in Diskover

Once Azure containers are indexed, you can use Diskover's standard search syntax to query Azure-specific metadata alongside regular file attributes. Here are practical search examples organized by common workflow.

Basic Tier Searches

Query	Description
`azure_tier:Hot`	Find all blobs in the Hot (most expensive) storage tier
`azure_tier:Cool`	Find all blobs in the Cool tier
`azure_tier:Archive`	Find all blobs in the Archive (cheapest) tier

Cost Optimization Searches

Query	Description
`azure_tier:Hot AND size:>1073741824`	Hot tier blobs larger than 1 GB — candidates for tiering down
`azure_tier:Hot AND mtime:<now-365d`	Hot tier blobs not modified in over a year — likely should be Cool or Archive
`azure_tier:Hot AND mtime:<now-90d AND size:>104857600`	Hot tier blobs over 100 MB and untouched for 90+ days
`azure_tier:Cool AND mtime:<now-180d`	Cool tier blobs older than 180 days — candidates for Archive tier
`azure_tier:Hot AND extension:(mp4 OR mov OR avi)`	Large video files in Hot storage — often good archive candidates

Change Detection and Versioning

Query	Description
`azure_etag:0x8DC*`	Find blobs with ETags matching a specific prefix pattern
`azure_etag:*`	Find all blobs that have an ETag (i.e., all files, excluding virtual directories)

File Type Analysis

Query	Description
`azure_tier:Archive AND (extension:zip OR extension:tar OR extension:gz)`	Archived compressed files
`azure_tier:Hot AND extension:log`	Log files in Hot storage — often candidates for cleanup or archiving
`azure_tier:Hot AND extension:bak`	Backup files in Hot storage — should typically be moved to Cool or Archive
`path_parent:\/mycontainer\/backups AND azure_tier:Hot`	Hot tier files specifically in a backups directory

Hybrid / Cross-Platform Searches

Query	Description
`extension:pdf AND size:>10485760`	Find large PDFs across all indexed sources (Azure + on-premises)
`path_parent:\/mycontainer* AND mtime:>now-7d`	Recently modified files in any Azure container
`name:report AND azure_tier:*`	Find files with "report" in the name that are stored in Azure (the `azure_tier:*` filter limits results to Azure-indexed data)

Combining with Diskover Tags

Query	Description
`azure_tier:Hot AND tags:delete`	Hot tier blobs marked for deletion
`azure_tier:Archive AND tags:retain`	Archived blobs with a retention tag

Sample Diskover query output for Azure Tier = “hot”

This shows all content in the Azure Blob stroage that is on the HOT tier along with it’s ETag value!

Troubleshooting

Common Issues

Issue	Cause	Solution
`ImportError: No module named 'azure'`	Azure SDK packages not installed	Run `python3 -m pip install azure-storage-blob azure-identity`
`Invalid tree_dir arg for Azure Storage Blob scanner`	Path doesn't start with `az://`	Ensure path uses the `az://` prefix (e.g., `az://mycontainer`)
Scanner starts but finds 0 blobs	Container is empty or prefix doesn't match any blobs	Verify the container name and prefix; list blobs manually to confirm
`connectionstring or storagebloburl not set in config`	No authentication credentials configured	Set environment variables or configure via Diskover Admin / TOML files
`invalid authmethod set in config`	`auth_method` is not `connectionstring` or `credentials`	Check spelling in configuration — must be exactly `connectionstring` or `credentials`

Authentication Failures

Symptom: Scanner fails to start with authentication or authorization errors.

Diagnosis:

Verify which auth method is configured:

Linux:

echo "Connection String: ${AZURE_STORAGE_CONNECTION_STRING:0:50}..."
echo "Blob URL: $AZURE_STORAGE_BLOB_URL"
echo "Tenant ID: $AZURE_TENANT_ID"

Windows:

Write-Output "Connection String: $($env:AZURE_STORAGE_CONNECTION_STRING.Substring(0,50))..."
Write-Output "Blob URL: $env:AZURE_STORAGE_BLOB_URL"
Write-Output "Tenant ID: $env:AZURE_TENANT_ID"

Test connection string validity:

python3 -c "
from azure.storage.blob import BlobServiceClient
import os
try:
    client = BlobServiceClient.from_connection_string(os.getenv('AZURE_STORAGE_CONNECTION_STRING'))
    print('Connection string valid')
except Exception as e:
    print(f'Error: {e}')
"

Test Azure AD credentials:

python3 -c "
from azure.identity import ClientSecretCredential
import os
try:
    cred = ClientSecretCredential(
        os.getenv('AZURE_TENANT_ID'),
        os.getenv('AZURE_CLIENT_ID'),
        os.getenv('AZURE_CLIENT_SECRET')
    )
    token = cred.get_token('https://storage.azure.com/.default')
    print('Credentials valid, token obtained')
except Exception as e:
    print(f'Error: {e}')
"

Resolution:

Verify the connection string has not expired or been regenerated in the Azure Portal
Ensure the Azure AD application has the Storage Blob Data Reader role
Check that the client secret has not expired
Verify tenant ID, client ID, and secret are correctly copied (no trailing whitespace)

Container Not Found

Symptom: Scanner reports the container is not found or access is denied for a specific container.

Diagnosis:

List available containers to confirm the name:

python3 -c "
from azure.storage.blob import BlobServiceClient
import os
client = BlobServiceClient.from_connection_string(os.getenv('AZURE_STORAGE_CONNECTION_STRING'))
for c in client.list_containers():
    print(c.name)
"

Resolution:

Azure container names are case-sensitive — verify the exact name
Ensure the container exists in the correct storage account
For Azure AD auth, verify the role is assigned at the container or account level

Network / Firewall Issues

Symptom: Connection timeouts or network unreachable errors when accessing Azure storage.

Diagnosis:

Test basic HTTPS connectivity:

Linux:

curl -I "https://mystorageaccount.blob.core.windows.net/"

Windows:

Invoke-WebRequest -Uri "https://mystorageaccount.blob.core.windows.net/" -Method Head

Check DNS resolution:

Linux:

nslookup mystorageaccount.blob.core.windows.net

Windows:

Resolve-DnsName mystorageaccount.blob.core.windows.net

Test port connectivity:

Linux:

nc -zv mystorageaccount.blob.core.windows.net 443

Windows:

Test-NetConnection -ComputerName mystorageaccount.blob.core.windows.net -Port 443

Resolution:

Verify firewall allows outbound HTTPS (port 443) to *.blob.core.windows.net
If using Azure Private Endpoints, ensure DNS is configured correctly
Configure a proxy if required by your network: export HTTPS_PROXY="http://proxy.example.com:8080"
Check Azure Storage firewall settings allow your scanner host's IP address

Permission Denied on Blobs

Symptom: Scanner connects but fails to read certain blobs or list container contents.

Resolution:

For connection string auth: verify the key has read and list permissions
For Azure AD auth: assign the Storage Blob Data Reader role at the account or container level
If using hierarchical namespace (ADLS Gen2), check container-level ACLs
If using a SAS token, verify it includes read and list permissions

Debug Logging

Enable detailed logging to diagnose any issue:

python3 /opt/diskover/diskover.py --altscanner scandir_azure --loglevel DEBUG az://mycontainer

Log file locations:

Linux: /var/log/diskover/diskover.log
Windows: Check Diskover service logs or the configured log location

Support

Last Updated: April 2026

Azure Blob Storage

Overview / Use Cases

Understanding Azure Blob Storage

Key Concepts

Storage Tiers

Authentication Methods

Requirements

System Requirements

Python Dependencies

Azure Prerequisites

Permissions

Installation

Step 1: Install Scanner Package

Step 2: Install Python Dependencies

Step 3: Verify Installation

Step 4: Configure Authentication

Step 5: Verify Azure Connectivity

Configuration

Configuration Parameters

Configuration via Diskover Admin

Environment Variable Overrides

Configuration Examples

Usage / Execution

Basic Usage

Scan All Containers

Scan a Specific Blob Prefix

With Custom Index Name

Multiple Containers in a Single Index

With Verbose Logging

With Parallel Worker Threads

Path Format Reference

Index Path Mapping

Integration with Index Tasks

Performance Tips

Metadata Fields / Elasticsearch Mappings

Field Mappings

Elasticsearch Mapping Definition

Example Indexed Document (File)

Example Indexed Document (Directory)

Searching in Diskover

Basic Tier Searches

Cost Optimization Searches

Change Detection and Versioning

File Type Analysis

Hybrid / Cross-Platform Searches

Combining with Diskover Tags

Troubleshooting

Common Issues

Authentication Failures

Container Not Found

Network / Firewall Issues

Permission Denied on Blobs

Debug Logging

Support

Related articles