AWS S3 & S3 Compatible

License: PRO (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.

Overview

The Diskover S3 Scanner brings your Amazon S3 cloud storage into Diskover, letting you search, browse, and analyze objects stored across S3 buckets just like you would with on-premises files. Instead of logging into the AWS console and clicking through bucket after bucket, you get a single, unified view of all your cloud data — complete with S3-specific metadata like storage classes and ETags.

Whether you're tracking down a specific file buried in a multi-terabyte data lake, figuring out which buckets are costing you the most, or building an inventory of assets across your entire AWS account, this scanner gives you the visibility you need.

The scanner also works with S3-compatible object stores like MinIO, Wasabi, DigitalOcean Spaces, Backblaze B2, and Cloudflare R2 — so you're not limited to AWS.

Use Cases

Data Discovery and Cataloging
Organizations with data distributed across multiple S3 buckets often struggle to know what's where. The S3 scanner indexes all accessible buckets in a single operation, giving teams a unified, searchable view of cloud storage assets. You can locate specific files across buckets in seconds rather than manually browsing through the S3 console — making it easy to build comprehensive inventories for licensing, rights management, or data governance initiatives.

Cost Optimization
S3 offers multiple storage tiers with dramatically different price points, and understanding how your data is distributed across those tiers is the first step to reducing costs. By indexing storage class metadata alongside file age and size, you can quickly identify candidates for tier transitions — for example, old STANDARD-class data that should be moved to STANDARD_IA or GLACIER. Track tier distribution over time to measure the impact of your optimization efforts.

Sample of Storage Tier analysis using metadata from the S3 scanner:

The S3 Scanners can show us data points on many files and how much storage allocation we’re taking up per tier!

Understanding Amazon S3

Amazon S3 (Simple Storage Service) organizes data into buckets, which are top-level containers that hold objects (your files). While S3 is technically a flat key-value store, it uses forward slashes (/) in object keys to simulate a folder hierarchy — and the Diskover S3 scanner translates this into the familiar directory structure you see in the Diskover Web UI.

Storage Classes

One of the most important pieces of metadata the scanner captures is the storage class, which determines both how quickly you can access an object and how much it costs to store. Here's a quick reference:

Storage Class	Access Speed	Use Case	Relative Cost
`STANDARD`	Milliseconds	Frequently accessed data	Highest
`INTELLIGENT_TIERING`	Milliseconds	Unpredictable access patterns	Auto-optimized
`STANDARD_IA`	Milliseconds	Infrequent access, rapid retrieval needed	Lower
`ONEZONE_IA`	Milliseconds	Infrequent access, single-AZ durability acceptable	Lower still
`GLACIER_IR`	Milliseconds	Archive with instant retrieval	Archive tier
`GLACIER`	Minutes to hours	Long-term archive	Low
`DEEP_ARCHIVE`	Up to 12 hours	Rarely accessed, lowest cost	Lowest

Understanding these tiers matters because Diskover lets you search by storage class — so you can find expensive STANDARD-class objects that haven't been accessed in months and should be transitioned to a cheaper tier.

ETags

The scanner also captures ETags (entity tags), which are identifiers assigned to each object. For objects uploaded in a single operation, the ETag is typically the MD5 hash of the content — useful for spotting duplicates and verifying integrity. For multipart uploads, the ETag follows a <hash>-<part_count> format, so you can also identify objects that were uploaded in chunks.

How the Scanner Maps S3 to Diskover

S3's flat namespace gets converted into a hierarchical directory structure for Diskover:

S3 Location	Diskover Index Path
Bucket `mybucket`	`/mybucket`
Object `mybucket/folder/file.txt`	`/mybucket/folder/file.txt`
Object `mybucket/deep/nested/doc.pdf`	`/mybucket/deep/nested/doc.pdf`

This means you can browse S3 buckets in Diskover's file browser just like a normal filesystem, and all standard search queries work against the indexed data.

Requirements

System Requirements

Component	Requirement
Python	3.9 or higher
Diskover	Core installation with alternate scanner support
Network	HTTPS access to AWS S3 endpoints (`*.s3.amazonaws.com` or custom endpoints)

Python Dependencies

Package	Version	Purpose
`boto3`	1.26+	AWS SDK for Python — provides the S3 client and paginator interfaces
`botocore`	1.29+	Low-level AWS client library (installed automatically with boto3)

AWS Permissions

The scanner needs read-only access to your S3 buckets. Here's the minimum IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DiskoverS3ScannerReadAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:ListAllMyBuckets",
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::*",
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}

Tip: To restrict access to specific buckets, re the wildcard Resource ARNs with specific bucket ARNs — for example, arn:aws:s3:::mybucket and arn:aws:s3:::mybucket/*.

Installation

Step 1: Install the Scanner Package

Linux:

dnf install diskover-scanner-s3

Windows:

The scanner files are included with the Diskover Windows installation. No separate installation step is required.

Install locations:

Linux: /opt/diskover/scanners/scandir_s3/
Windows: C:\Program Files\Diskover\scanners\scandir_s3\

Step 2: Install Python Dependencies

Install the AWS SDK:

pyton3 -m pip install boto3

Step 3: Verify Installation

Confirm boto3 is installed and accessible:

python3 -c "import boto3; print(f'boto3 version: {boto3.__version__}')"

You should see the boto3 version printed without errors.

Step 4: Configure AWS Credentials

The scanner uses the standard AWS credential chain. Choose the method that fits your environment:

Option A: Environment Variables (recommended for task-specific credentials)

export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export AWS_DEFAULT_REGION="us-east-1"

On Windows:

$env:AWS_ACCESS_KEY_ID = "AKIAIOSFODNN7EXAMPLE"
$env:AWS_SECRET_ACCESS_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
$env:AWS_DEFAULT_REGION = "us-east-1"

Option B: AWS Credentials File (recommended for persistent configuration)

Create or edit ~/.aws/credentials:

[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

[diskover-scanner]
aws_access_key_id = AKIAI44QH8DHBEXAMPLE
aws_secret_access_key = je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY

Create or edit ~/.aws/config:

[default]
region = us-east-1

[profile diskover-scanner]
region = us-west-2

To use a specific profile:

export AWS_PROFILE=diskover-scanner

Option C: IAM Instance Role (when running on EC2/ECS)

Attach an IAM role to your instance or task with the required permissions. No explicit credential configuration is needed — boto3 picks up the role automatically.

Option D: IAM Role Assumption (for cross-account access)

Use aws sts assume-role to obtain temporary credentials and export them as environment variables.

Step 5: Verify AWS Connectivity

Test that the scanner can reach your S3 buckets:

python3 -c "
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()
print(f'Found {len(buckets[\"Buckets\"])} buckets:')
for bucket in buckets['Buckets'][:5]:
    print(f'  - {bucket[\"Name\"]}')
"

You should see a list of your accessible buckets.

# python3 -c "
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()
print(f'Found {len(buckets[\"Buckets\"])} buckets:')
for bucket in buckets['Buckets'][:5]:
    print(f'  - {bucket[\"Name\"]}')
"
Found 17 buckets:
  - diskover-demo-bucket
  - diskover-demo-llm-sync
  - diskover-ddoe-content
  - diskover-ddoe-dataset-one
  - diskover-ddoe-dataset-two
  .......

Configuration

Configuration is managed through the Diskover Admin UI under Settings > Alternate Scanners > S3.

Configuration Parameters

Parameter	Type	Default	Description
`client_usessl`	bool	`true`	Use SSL/TLS for S3 connections. Only set to `false` for local development with non-SSL endpoints.
`client_verify`	bool	`true`	Verify SSL certificates. Set to `false` to skip verification, or provide a CA bundle path via the `S3_VERIFY` environment variable.
`max_pool_connections`	int	`25`	Maximum connections in the boto3 connection pool. Increase for high-throughput scanning of large buckets.

YAML Configuration Example

Diskover:
  Alternate Scanners:
    S3:
      Default:
        client_usessl: true
        client_verify: true
        max_pool_connections: 25

Task-Level Environment Variables

These variables are set per scan task (not in global configuration) since different tasks may target different buckets or endpoints:

Environment Variable	Description
`S3_ENDPOINT_URL`	Custom S3 endpoint URL for S3-compatible storage (e.g., MinIO, Wasabi). Overrides the default AWS endpoint. Must include the protocol (`https://` or `http://`).
`S3_BUCKET`	Default bucket name. Can be overridden by the command-line path argument.

S3-Compatible Service Endpoints

The scanner works with any S3-compatible object storage by setting the S3_ENDPOINT_URL environment variable:

Service	Endpoint Format	Notes
AWS S3	(Automatic)	No endpoint URL needed
MinIO	`https://host:port`	May require custom CA certificate
DigitalOcean Spaces	`https://region.digitaloceanspaces.com`	Region embedded in endpoint
Wasabi	`https://s3.region.wasabisys.com`	Region embedded in endpoint
Backblaze B2	`https://s3.region.backblazeb2.com`	Bucket-specific endpoint
Cloudflare R2	`https://accountid.r2.cloudflarestorage.com`	Account ID in endpoint

Configuration Examples

Standard AWS Configuration (most common)

Leave all settings at their defaults. Credentials and region are handled via the AWS credential chain:

Diskover:
  Alternate Scanners:
    S3:
      Default:
        client_usessl: true
        client_verify: true
        max_pool_connections: 25

High-Throughput Configuration for Large Buckets

Increase the connection pool for scanning buckets with millions of objects:

Diskover:
  Alternate Scanners:
    S3:
      Default:
        client_usessl: true
        client_verify: true
        max_pool_connections: 75

S3-Compatible Storage (MinIO with Self-Signed Certificate)

Set the endpoint and disable SSL verification at the task level:

export S3_ENDPOINT_URL="https://minio.example.com:9000"
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin123"

Diskover:
  Alternate Scanners:
    S3:
      Default:
        client_usessl: true
        client_verify: false
        max_pool_connections: 25

Note: For production S3-compatible deployments, use a proper CA bundle (S3_VERIFY=/path/to/ca-bundle.crt) instead of disabling verification entirely.

Usage

The S3 scanner uses the standard --altscanner flag with diskover.py.

Basic Commands

Scan a specific bucket:

# Linux
cd /opt/diskover
python3 diskover.py --altscanner scandir_s3 s3://mybucket

# Windows
cd "C:\Program Files\Diskover"
python diskover.py --altscanner scandir_s3 s3://mybucket

Scan all accessible buckets in the AWS account:

python3 diskover.py --altscanner scandir_s3 s3://

This automatically discovers all buckets and creates separate top paths for each one (e.g., /mybucket1, /mybucket2).

Scan a specific prefix (folder) within a bucket:

python3 diskover.py --altscanner scandir_s3 s3://mybucket/data/2024

Scan with a custom Elasticsearch index name:

python3 diskover.py -i diskover-s3-data --altscanner scandir_s3 s3://mybucket

Scan multiple specific buckets into one index:

python3 diskover.py -i diskover-s3-all --altscanner scandir_s3 s3://bucket1 s3://bucket2 s3://bucket3

Enable debug logging for troubleshooting:

python3 diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket

Scan S3-compatible storage (e.g., MinIO):

export S3_ENDPOINT_URL="https://minio.example.com:9000"
python3 diskover.py --altscanner scandir_s3 s3://mybucket

Path Format Reference

Path Format	Description	Example
`s3://`	Scan all accessible buckets in the AWS account	`s3://`
`s3://<bucket>`	Scan an entire bucket	`s3://mybucket`
`s3://<bucket>/<prefix>`	Scan a specific object prefix	`s3://mybucket/data/2024`
`s3://<bucket>/<prefix>/`	Scan prefix (trailing slash is optional)	`s3://mybucket/archives/`

Integration with Index Tasks

When configuring the S3 scanner as part of a scheduled Index Task:

Field	Value
Alternate Scanner	`scandir_s3 or S3`

Sample S3 Storage Scan - Task Configuration:

Here we can see a basic setup for an S3 Scan. At times if you’re using an alternate credentials profile or a non AWS S3 Compatible object storage you might need to include some additional configuraiton in your Diskover Scan Task config:

Set the scan path to your desired s3:// path in the task configuration.

Performance Tips

The max_pool_connections parameter controls boto3's connection pool size. Here's a rough sizing guide:

Scenario	Recommended Pool Size	Notes
Small buckets (<100K objects)	25 (default)	Default is sufficient
Medium buckets (100K–1M objects)	50	Reduces connection wait time
Large buckets (>1M objects)	75–100	Balance with available system resources

For best performance when scanning from AWS infrastructure, deploy Diskover in the same AWS region as your S3 buckets and use S3 VPC endpoints to avoid internet routing overhead. The scanner handles S3's 1,000-object-per-request limit automatically via the boto3 paginator, so memory usage stays constant regardless of bucket size.

Metadata Fields

The S3 scanner adds two custom metadata fields to every indexed object, alongside the standard file metadata (name, path, size, timestamps, etc.).

Field Mappings

Field Path	ES Type	Description
`s3_etag`	`keyword`	S3 object ETag — an MD5 hash for single-part uploads, or `<hash>-<part_count>` for multipart uploads. Useful for deduplication and change detection.
`s3_storageclass`	`keyword`	S3 storage class: `STANDARD`, `INTELLIGENT_TIERING`, `STANDARD_IA`, `ONEZONE_IA`, `GLACIER_IR`, `GLACIER`, `DEEP_ARCHIVE`, `REDUCED_REDUNDANCY`, or `OUTPOSTS`.

Elasticsearch Mapping Definition

{
  "mappings": {
    "properties": {
      "s3_etag": {
        "type": "keyword"
      },
      "s3_storageclass": {
        "type": "keyword"
      }
    }
  }
}

Example Indexed Document

Here's what a typical S3 object looks like once indexed:

{
  "name": "quarterly_report_2024_q1.xlsx",
  "path": "/mybucket/reports/finance/quarterly_report_2024_q1.xlsx",
  "path_parent": "/mybucket/reports/finance",
  "extension": "xlsx",
  "size": 1048576,
  "size_du": 1048576,
  "mtime": "2024-04-15T10:30:00Z",
  "atime": "2024-04-15T10:30:00Z",
  "ctime": "2024-04-15T10:30:00Z",
  "type": "file",
  "s3_etag": "d8e8fca2dc0f896fd7cb4cb0031ba249",
  "s3_storageclass": "STANDARD"
}

Searching in Diskover

Once S3 data is indexed, you can use Diskover's standard search syntax to query S3-specific metadata fields.

Search Query Examples

By Storage Class

Query	Description
`s3_storageclass:STANDARD`	Find all objects in the STANDARD storage class
`s3_storageclass:(GLACIER OR DEEP_ARCHIVE)`	Find all archived objects
`s3_storageclass:INTELLIGENT_TIERING`	Find objects using intelligent tiering

Cost Optimization Queries

Query	Description
`s3_storageclass:STANDARD AND mtime:<now-90d`	STANDARD objects older than 90 days — candidates for STANDARD_IA
`s3_storageclass:STANDARD AND size:>1073741824`	Large STANDARD objects over 1 GB — high-cost candidates
`s3_storageclass:(STANDARD OR STANDARD_IA) AND mtime:<now-365d`	Objects older than 1 year still in expensive tiers
`s3_storageclass:STANDARD AND extension:(mp4 OR mov OR avi)`	Large video files in STANDARD — consider GLACIER for archival

ETag Queries

Query	Description
`s3_etag:d8e8fca2dc0f896fd7cb4cb0031ba249`	Find objects with a specific ETag (exact match for deduplication)
`s3_etag:-`	Find multipart-uploaded objects (ETag contains a dash)

Combined Queries

Query	Description
`mtime:<now-2555d`	Find objects older than 7 years (retention policy review)
`path_parent:logs AND s3_storageclass:STANDARD AND mtime:<now-30d`	Old log files still in STANDARD — good cleanup candidates

Sample Diskover query output for S3 content:

Here we can see content spanning multiple S3 Storage Classes with individual ETag values!

Troubleshooting

Common Issues

Issue	Cause	Solution
"Unable to locate credentials"	No AWS credentials configured	Set environment variables, configure `~/.aws/credentials`, or attach an IAM role
"InvalidAccessKeyId"	Incorrect or expired access key	Verify credentials with `aws sts get-caller-identity`
"AccessDenied" on ListBucket	Missing IAM permissions	Add the required S3 permissions (see Requirements section)
"AccessDenied" on specific bucket	Bucket policy denies access	Check bucket policy and ensure no explicit deny rules exist
SSL certificate verification error	Self-signed cert on S3-compatible endpoint	Set `S3_VERIFY=false` or provide a CA bundle path
"No module named 'boto3'"	boto3 not installed	Run `pip3 install boto3`
Bucket not found / skipped	Bucket name incorrect or wrong region	Verify bucket name (case-sensitive) and set the correct `AWS_DEFAULT_REGION`
"InvalidObjectState" for Glacier objects	Glacier/Deep Archive objects require restoration before direct read	Expected behavior — the scanner still indexes these objects with available metadata
Slow scan performance	Connection pool too small for large buckets	Increase `max_pool_connections` to 50–100
Endpoint connection refused	Missing protocol in `S3_ENDPOINT_URL`	Ensure URL includes `https://` or `http://`

Debug Logging

Enable detailed logging to diagnose issues:

# Linux
python3 /opt/diskover/diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket

# Windows
cd "C:\Program Files\Diskover"
python diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket

Log File Locations

Linux: /var/log/diskover/diskover.log
Windows: Check the Diskover service logs or your configured log location

Verifying Credentials

If you suspect a credentials issue, run this diagnostic:

python3 -c "
import boto3
session = boto3.Session()
credentials = session.get_credentials()
if credentials:
    print(f'Access Key: {credentials.access_key[:10]}...')
    print(f'Method: {credentials.method}')
else:
    print('No credentials found')
"

Then confirm you can reach S3:

aws sts get-caller-identity
aws s3 ls

Support

Last Updated: March 2026
Diskover Data, Inc.