AWS S3 & S3 Compatible
License: PRO (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview
The Diskover S3 Scanner brings your Amazon S3 cloud storage into Diskover, letting you search, browse, and analyze objects stored across S3 buckets just like you would with on-premises files. Instead of logging into the AWS console and clicking through bucket after bucket, you get a single, unified view of all your cloud data — complete with S3-specific metadata like storage classes and ETags.
Whether you're tracking down a specific file buried in a multi-terabyte data lake, figuring out which buckets are costing you the most, or building an inventory of assets across your entire AWS account, this scanner gives you the visibility you need.
The scanner also works with S3-compatible object stores like MinIO, Wasabi, DigitalOcean Spaces, Backblaze B2, and Cloudflare R2 — so you're not limited to AWS.
Use Cases
Data Discovery and Cataloging
Organizations with data distributed across multiple S3 buckets often struggle to know what's where. The S3 scanner indexes all accessible buckets in a single operation, giving teams a unified, searchable view of cloud storage assets. You can locate specific files across buckets in seconds rather than manually browsing through the S3 console — making it easy to build comprehensive inventories for licensing, rights management, or data governance initiatives.
Cost Optimization
S3 offers multiple storage tiers with dramatically different price points, and understanding how your data is distributed across those tiers is the first step to reducing costs. By indexing storage class metadata alongside file age and size, you can quickly identify candidates for tier transitions — for example, old STANDARD-class data that should be moved to STANDARD_IA or GLACIER. Track tier distribution over time to measure the impact of your optimization efforts.
Sample of Storage Tier analysis using metadata from the S3 scanner:
The S3 Scanners can show us data points on many files and how much storage allocation we’re taking up per tier!
Understanding Amazon S3
Amazon S3 (Simple Storage Service) organizes data into buckets, which are top-level containers that hold objects (your files). While S3 is technically a flat key-value store, it uses forward slashes (/) in object keys to simulate a folder hierarchy — and the Diskover S3 scanner translates this into the familiar directory structure you see in the Diskover Web UI.
Storage Classes
One of the most important pieces of metadata the scanner captures is the storage class, which determines both how quickly you can access an object and how much it costs to store. Here's a quick reference:
Storage Class |
Access Speed |
Use Case |
Relative Cost |
|---|---|---|---|
|
Milliseconds |
Frequently accessed data |
Highest |
|
Milliseconds |
Unpredictable access patterns |
Auto-optimized |
|
Milliseconds |
Infrequent access, rapid retrieval needed |
Lower |
|
Milliseconds |
Infrequent access, single-AZ durability acceptable |
Lower still |
|
Milliseconds |
Archive with instant retrieval |
Archive tier |
|
Minutes to hours |
Long-term archive |
Low |
|
Up to 12 hours |
Rarely accessed, lowest cost |
Lowest |
Understanding these tiers matters because Diskover lets you search by storage class — so you can find expensive STANDARD-class objects that haven't been accessed in months and should be transitioned to a cheaper tier.
ETags
The scanner also captures ETags (entity tags), which are identifiers assigned to each object. For objects uploaded in a single operation, the ETag is typically the MD5 hash of the content — useful for spotting duplicates and verifying integrity. For multipart uploads, the ETag follows a <hash>-<part_count> format, so you can also identify objects that were uploaded in chunks.
How the Scanner Maps S3 to Diskover
S3's flat namespace gets converted into a hierarchical directory structure for Diskover:
S3 Location |
Diskover Index Path |
|---|---|
Bucket |
|
Object |
|
Object |
|
This means you can browse S3 buckets in Diskover's file browser just like a normal filesystem, and all standard search queries work against the indexed data.
Requirements
System Requirements
Component |
Requirement |
|---|---|
Python |
3.9 or higher |
Diskover |
Core installation with alternate scanner support |
Network |
HTTPS access to AWS S3 endpoints ( |
Python Dependencies
Package |
Version |
Purpose |
|---|---|---|
|
1.26+ |
AWS SDK for Python — provides the S3 client and paginator interfaces |
|
1.29+ |
Low-level AWS client library (installed automatically with boto3) |
AWS Permissions
The scanner needs read-only access to your S3 buckets. Here's the minimum IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DiskoverS3ScannerReadAccess",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:ListAllMyBuckets",
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::*",
"arn:aws:s3:::*/*"
]
}
]
}
Tip: To restrict access to specific buckets, re the wildcard
ResourceARNs with specific bucket ARNs — for example,arn:aws:s3:::mybucketandarn:aws:s3:::mybucket/*.
Installation
Step 1: Install the Scanner Package
Linux:
dnf install diskover-scanner-s3
Windows:
The scanner files are included with the Diskover Windows installation. No separate installation step is required.
Install locations:
Linux:
/opt/diskover/scanners/scandir_s3/Windows:
C:\Program Files\Diskover\scanners\scandir_s3\
Step 2: Install Python Dependencies
Install the AWS SDK:
pyton3 -m pip install boto3
Step 3: Verify Installation
Confirm boto3 is installed and accessible:
python3 -c "import boto3; print(f'boto3 version: {boto3.__version__}')"
You should see the boto3 version printed without errors.
Step 4: Configure AWS Credentials
The scanner uses the standard AWS credential chain. Choose the method that fits your environment:
Option A: Environment Variables (recommended for task-specific credentials)
export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" export AWS_DEFAULT_REGION="us-east-1"
On Windows:
$env:AWS_ACCESS_KEY_ID = "AKIAIOSFODNN7EXAMPLE" $env:AWS_SECRET_ACCESS_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" $env:AWS_DEFAULT_REGION = "us-east-1"
Option B: AWS Credentials File (recommended for persistent configuration)
Create or edit ~/.aws/credentials:
[default] aws_access_key_id = AKIAIOSFODNN7EXAMPLE aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY [diskover-scanner] aws_access_key_id = AKIAI44QH8DHBEXAMPLE aws_secret_access_key = je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
Create or edit ~/.aws/config:
[default] region = us-east-1 [profile diskover-scanner] region = us-west-2
To use a specific profile:
export AWS_PROFILE=diskover-scanner
Option C: IAM Instance Role (when running on EC2/ECS)
Attach an IAM role to your instance or task with the required permissions. No explicit credential configuration is needed — boto3 picks up the role automatically.
Option D: IAM Role Assumption (for cross-account access)
Use aws sts assume-role to obtain temporary credentials and export them as environment variables.
Step 5: Verify AWS Connectivity
Test that the scanner can reach your S3 buckets:
python3 -c "
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()
print(f'Found {len(buckets[\"Buckets\"])} buckets:')
for bucket in buckets['Buckets'][:5]:
print(f' - {bucket[\"Name\"]}')
"
You should see a list of your accessible buckets.
# python3 -c "
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()
print(f'Found {len(buckets[\"Buckets\"])} buckets:')
for bucket in buckets['Buckets'][:5]:
print(f' - {bucket[\"Name\"]}')
"
Found 17 buckets:
- diskover-demo-bucket
- diskover-demo-llm-sync
- diskover-ddoe-content
- diskover-ddoe-dataset-one
- diskover-ddoe-dataset-two
.......
Configuration
Configuration is managed through the Diskover Admin UI under Settings > Alternate Scanners > S3.
Configuration Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Use SSL/TLS for S3 connections. Only set to |
|
bool |
|
Verify SSL certificates. Set to |
|
int |
|
Maximum connections in the boto3 connection pool. Increase for high-throughput scanning of large buckets. |
YAML Configuration Example
Diskover:
Alternate Scanners:
S3:
Default:
client_usessl: true
client_verify: true
max_pool_connections: 25
Task-Level Environment Variables
These variables are set per scan task (not in global configuration) since different tasks may target different buckets or endpoints:
Environment Variable |
Description |
|---|---|
|
Custom S3 endpoint URL for S3-compatible storage (e.g., MinIO, Wasabi). Overrides the default AWS endpoint. Must include the protocol ( |
|
Default bucket name. Can be overridden by the command-line path argument. |
S3-Compatible Service Endpoints
The scanner works with any S3-compatible object storage by setting the S3_ENDPOINT_URL environment variable:
Service |
Endpoint Format |
Notes |
|---|---|---|
AWS S3 |
(Automatic) |
No endpoint URL needed |
MinIO |
|
May require custom CA certificate |
DigitalOcean Spaces |
|
Region embedded in endpoint |
Wasabi |
|
Region embedded in endpoint |
Backblaze B2 |
|
Bucket-specific endpoint |
Cloudflare R2 |
|
Account ID in endpoint |
Configuration Examples
Standard AWS Configuration (most common)
Leave all settings at their defaults. Credentials and region are handled via the AWS credential chain:
Diskover:
Alternate Scanners:
S3:
Default:
client_usessl: true
client_verify: true
max_pool_connections: 25
High-Throughput Configuration for Large Buckets
Increase the connection pool for scanning buckets with millions of objects:
Diskover:
Alternate Scanners:
S3:
Default:
client_usessl: true
client_verify: true
max_pool_connections: 75
S3-Compatible Storage (MinIO with Self-Signed Certificate)
Set the endpoint and disable SSL verification at the task level:
export S3_ENDPOINT_URL="https://minio.example.com:9000" export AWS_ACCESS_KEY_ID="minioadmin" export AWS_SECRET_ACCESS_KEY="minioadmin123"
Diskover:
Alternate Scanners:
S3:
Default:
client_usessl: true
client_verify: false
max_pool_connections: 25
Note: For production S3-compatible deployments, use a proper CA bundle (
S3_VERIFY=/path/to/ca-bundle.crt) instead of disabling verification entirely.
Usage
The S3 scanner uses the standard --altscanner flag with diskover.py.
Basic Commands
Scan a specific bucket:
# Linux cd /opt/diskover python3 diskover.py --altscanner scandir_s3 s3://mybucket # Windows cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_s3 s3://mybucket
Scan all accessible buckets in the AWS account:
python3 diskover.py --altscanner scandir_s3 s3://
This automatically discovers all buckets and creates separate top paths for each one (e.g., /mybucket1, /mybucket2).
Scan a specific prefix (folder) within a bucket:
python3 diskover.py --altscanner scandir_s3 s3://mybucket/data/2024
Scan with a custom Elasticsearch index name:
python3 diskover.py -i diskover-s3-data --altscanner scandir_s3 s3://mybucket
Scan multiple specific buckets into one index:
python3 diskover.py -i diskover-s3-all --altscanner scandir_s3 s3://bucket1 s3://bucket2 s3://bucket3
Enable debug logging for troubleshooting:
python3 diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket
Scan S3-compatible storage (e.g., MinIO):
export S3_ENDPOINT_URL="https://minio.example.com:9000" python3 diskover.py --altscanner scandir_s3 s3://mybucket
Path Format Reference
Path Format |
Description |
Example |
|---|---|---|
|
Scan all accessible buckets in the AWS account |
|
|
Scan an entire bucket |
|
|
Scan a specific object prefix |
|
|
Scan prefix (trailing slash is optional) |
|
Integration with Index Tasks
When configuring the S3 scanner as part of a scheduled Index Task:
Field |
Value |
|---|---|
Alternate Scanner |
|
Sample S3 Storage Scan - Task Configuration:
Here we can see a basic setup for an S3 Scan. At times if you’re using an alternate credentials profile or a non AWS S3 Compatible object storage you might need to include some additional configuraiton in your Diskover Scan Task config:
Set the scan path to your desired s3:// path in the task configuration.
Performance Tips
The max_pool_connections parameter controls boto3's connection pool size. Here's a rough sizing guide:
Scenario |
Recommended Pool Size |
Notes |
|---|---|---|
Small buckets (<100K objects) |
25 (default) |
Default is sufficient |
Medium buckets (100K–1M objects) |
50 |
Reduces connection wait time |
Large buckets (>1M objects) |
75–100 |
Balance with available system resources |
For best performance when scanning from AWS infrastructure, deploy Diskover in the same AWS region as your S3 buckets and use S3 VPC endpoints to avoid internet routing overhead. The scanner handles S3's 1,000-object-per-request limit automatically via the boto3 paginator, so memory usage stays constant regardless of bucket size.
Metadata Fields
The S3 scanner adds two custom metadata fields to every indexed object, alongside the standard file metadata (name, path, size, timestamps, etc.).
Field Mappings
Field Path |
ES Type |
Description |
|---|---|---|
|
|
S3 object ETag — an MD5 hash for single-part uploads, or |
|
|
S3 storage class: |
Elasticsearch Mapping Definition
{
"mappings": {
"properties": {
"s3_etag": {
"type": "keyword"
},
"s3_storageclass": {
"type": "keyword"
}
}
}
}
Example Indexed Document
Here's what a typical S3 object looks like once indexed:
{
"name": "quarterly_report_2024_q1.xlsx",
"path": "/mybucket/reports/finance/quarterly_report_2024_q1.xlsx",
"path_parent": "/mybucket/reports/finance",
"extension": "xlsx",
"size": 1048576,
"size_du": 1048576,
"mtime": "2024-04-15T10:30:00Z",
"atime": "2024-04-15T10:30:00Z",
"ctime": "2024-04-15T10:30:00Z",
"type": "file",
"s3_etag": "d8e8fca2dc0f896fd7cb4cb0031ba249",
"s3_storageclass": "STANDARD"
}
Searching in Diskover
Once S3 data is indexed, you can use Diskover's standard search syntax to query S3-specific metadata fields.
Search Query Examples
By Storage Class
Query |
Description |
|---|---|
|
Find all objects in the STANDARD storage class |
|
Find all archived objects |
|
Find objects using intelligent tiering |
Cost Optimization Queries
Query |
Description |
|---|---|
|
STANDARD objects older than 90 days — candidates for STANDARD_IA |
|
Large STANDARD objects over 1 GB — high-cost candidates |
|
Objects older than 1 year still in expensive tiers |
|
Large video files in STANDARD — consider GLACIER for archival |
ETag Queries
Query |
Description |
|---|---|
|
Find objects with a specific ETag (exact match for deduplication) |
|
Find multipart-uploaded objects (ETag contains a dash) |
Combined Queries
Query |
Description |
|---|---|
|
Find objects older than 7 years (retention policy review) |
|
Old log files still in STANDARD — good cleanup candidates |
Sample Diskover query output for S3 content:
Here we can see content spanning multiple S3 Storage Classes with individual ETag values!
Troubleshooting
Common Issues
Issue |
Cause |
Solution |
|---|---|---|
"Unable to locate credentials" |
No AWS credentials configured |
Set environment variables, configure |
"InvalidAccessKeyId" |
Incorrect or expired access key |
Verify credentials with |
"AccessDenied" on ListBucket |
Missing IAM permissions |
Add the required S3 permissions (see Requirements section) |
"AccessDenied" on specific bucket |
Bucket policy denies access |
Check bucket policy and ensure no explicit deny rules exist |
SSL certificate verification error |
Self-signed cert on S3-compatible endpoint |
Set |
"No module named 'boto3'" |
boto3 not installed |
Run |
Bucket not found / skipped |
Bucket name incorrect or wrong region |
Verify bucket name (case-sensitive) and set the correct |
"InvalidObjectState" for Glacier objects |
Glacier/Deep Archive objects require restoration before direct read |
Expected behavior — the scanner still indexes these objects with available metadata |
Slow scan performance |
Connection pool too small for large buckets |
Increase |
Endpoint connection refused |
Missing protocol in |
Ensure URL includes |
Debug Logging
Enable detailed logging to diagnose issues:
# Linux python3 /opt/diskover/diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket # Windows cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_s3 --loglevel DEBUG s3://mybucket
Log File Locations
Linux:
/var/log/diskover/diskover.logWindows: Check the Diskover service logs or your configured log location
Verifying Credentials
If you suspect a credentials issue, run this diagnostic:
python3 -c "
import boto3
session = boto3.Session()
credentials = session.get_credentials()
if credentials:
print(f'Access Key: {credentials.access_key[:10]}...')
print(f'Method: {credentials.method}')
else:
print('No credentials found')
"
Then confirm you can reach S3:
aws sts get-caller-identity aws s3 ls
Support
Last Updated: March 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.