GCP - Google Cloud Storage
License: PRO+ (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview
The Diskover GCS Scanner brings your Google Cloud Storage into Diskover, letting you search, browse, and analyze objects stored across GCS buckets just like you would with on-premises files. Instead of clicking through the Google Cloud Console bucket by bucket, you get a single, unified view of all your cloud data — complete with GCS-specific metadata like storage classes, MD5 hashes, and CRC32C checksums.
Whether you're tracking down a specific file buried across multiple buckets, figuring out which storage tiers are costing you the most, or building an asset inventory for compliance, this scanner gives you the visibility you need.
Use Cases
Data Discovery and Cataloging
Organizations with data distributed across multiple GCS buckets often struggle to know what's where. The GCS scanner indexes all accessible buckets in a single operation, giving teams a unified, searchable view of cloud storage assets. You can locate specific files across buckets in seconds rather than manually browsing through the Google Cloud Console — making it easy to build comprehensive inventories for licensing, rights management, or data governance initiatives.
Cost Optimization
GCS offers multiple storage tiers with dramatically different price points, and understanding how your data is distributed across those tiers is the first step to reducing costs. By indexing storage class metadata alongside file age and size, you can quickly identify candidates for tier transitions — for example, old STANDARD-class data that should be moved to NEARLINE or COLDLINE. Track tier distribution over time to measure the impact of your optimization efforts.
Compliance and Governance
Organizations in regulated industries must maintain accurate records of data storage, demonstrate data handling practices, and respond to audit requests. The GCS scanner enables compliance workflows by indexing all objects with timestamps, MD5 hashes, and CRC32C checksums, allowing teams to verify data retention policies, detect unauthorized data placement, and accelerate audit response times from weeks to hours through automated searching and reporting.
Understanding Google Cloud Storage
Google Cloud Storage (GCS) is Google's highly durable, scalable object storage service. Like other object stores, GCS organizes data into buckets (top-level containers) that hold objects (your files). While GCS is technically a flat key-value store, it uses forward slashes (/) in object keys to simulate a folder hierarchy — and the Diskover GCS scanner translates this into the familiar directory structure you see in the Diskover Web UI.
Storage Classes
One of the most important pieces of metadata the scanner captures is the storage class, which determines both how quickly you can access an object and how much it costs to store. Here's a quick reference:
Storage Class |
Access Speed |
Use Case |
Relative Cost |
|---|---|---|---|
|
Milliseconds |
Frequently accessed data |
Highest |
|
Milliseconds |
Data accessed less than once a month |
Lower |
|
Milliseconds |
Data accessed less than once a quarter |
Lower still |
|
Milliseconds |
Long-term archive, accessed less than once a year |
Lowest |
|
Milliseconds |
Legacy class — high availability across regions |
Varies |
|
Milliseconds |
Legacy class — data stored in a single region |
Varies |
|
Milliseconds |
Legacy class — reduced availability |
Varies |
Understanding these tiers matters because Diskover lets you search by storage class — so you can find expensive STANDARD-class objects that haven't been accessed in months and should be transitioned to a cheaper tier.
Note: MULTI_REGIONAL, REGIONAL, and DURABLE_REDUCED_AVAILABILITY are legacy storage classes. Google recommends using STANDARD, NEARLINE, COLDLINE, or ARCHIVE for new data. The scanner captures whichever class GCS reports for existing objects.
Checksums: MD5 Hashes and CRC32C
The scanner captures two types of checksums for each object:
Checksum |
Available For |
Use Case |
|---|---|---|
MD5 Hash ( |
Single-upload objects |
Content deduplication, integrity verification, change detection |
CRC32C ( |
All objects (including composite) |
Fast integrity validation |
Important: Objects created via
gsutil composeor the compose API (called composite objects) do not have MD5 hashes — this is a GCS platform limitation, not a scanner issue. These objects will have an emptygcs_md5hashfield but will always have a CRC32C checksum.
How the Scanner Maps GCS to Diskover
GCS's flat namespace gets converted into a hierarchical directory structure for Diskover:
GCS Location |
Diskover Index Path |
|---|---|
Bucket |
|
Object |
|
Object |
|
This means you can browse GCS buckets in Diskover's file browser just like a normal filesystem, and all standard search queries work against the indexed data.
Requirements
System Requirements
Component |
Requirement |
|---|---|
Python |
3.9 or higher |
Diskover |
Core installation with alternate scanner support |
Network |
HTTPS access to Google Cloud Storage endpoints ( |
Python Dependencies
Package |
Version |
Purpose |
|---|---|---|
|
2.x |
Google Cloud Storage client library for bucket and blob operations |
|
2.x |
Google authentication library (installed automatically with google-cloud-storage) |
|
2.x |
Google API core library for exception handling (installed automatically with google-cloud-storage) |
Google Cloud IAM Requirements
The scanner requires read-only access to GCS buckets. Getting IAM permissions right is critical — this section walks through exactly what's needed.
Minimum Required IAM Roles
IAM Role |
What It Grants |
Why the Scanner Needs It |
|---|---|---|
|
|
List and read objects within buckets |
|
|
List all buckets in a project (required for |
Tip: If you only plan to scan specific named buckets (e.g.,
gs://mybucket) rather than auto-discovering all buckets withgs://, you can skiproles/storage.legacyBucketReaderand grant onlyroles/storage.objectVieweron individual buckets.
Granting Project-Level Access (All Buckets)
To allow the scanner to discover and scan all buckets in a GCP project:
PROJECT_ID="your-project-id"
SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com"
# Grant object read access across all buckets
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.objectViewer"
# Grant bucket listing access (required for gs:// all-bucket scanning)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.legacyBucketReader"
Granting Bucket-Level Access (Specific Buckets Only)
For tighter security, restrict access to only the buckets the scanner needs:
SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com"
# Grant access to a specific bucket
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.objectViewer"
Note: When granting bucket-level access, you must specify each bucket individually. The scanner will skip any buckets it cannot access and log a warning.
Verifying IAM Permissions
To confirm a service account has the correct roles:
# Check project-level roles
gcloud projects get-iam-policy $PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:diskover-scanner@*" \
--format="table(bindings.role)"
# Check bucket-level roles
gcloud storage buckets get-iam-policy gs://BUCKET_NAME
Custom IAM Policy (Least Privilege)
If your organization requires a custom IAM policy instead of predefined roles, here are the exact permissions the scanner needs:
{
"Version": "1",
"Statement": [
{
"Effect": "Allow",
"Action": [
"storage.objects.get",
"storage.objects.list",
"storage.buckets.get",
"storage.buckets.list"
],
"Resource": "*"
}
]
}
Important: If you're using Uniform Bucket-Level Access on your buckets (recommended by Google), ACLs are disabled and all access is controlled through IAM. If you're using Fine-Grained access control, ensure the service account is also granted read access through ACLs. When in doubt, use Uniform Bucket-Level Access — it's simpler to manage and audit.
Installation
Step 1: Install the Scanner Package
Linux:
dnf install diskover-scanner-gcp
Windows:
The scanner files are included with the Diskover Windows installation. No separate installation step is required.
Install locations:
Linux:
/opt/diskover/scanners/scandir_gcp/Windows:
C:\Program Files\Diskover\scanners\scandir_gcp\
Step 2: Install Python Dependencies
Install the Google Cloud Storage client library:
python3 -m pip install google-cloud-storage
Step 3: Verify Installation
Confirm google-cloud-storage is installed and accessible:
python3 -c "from google.cloud import storage; print(f'google-cloud-storage: OK')"
You should see google-cloud-storage: OK printed without errors.
Check the installed version:
python3 -m pip show google-cloud-storage | grep -E "^(Name|Version)"
Step 4: Configure GCP Credentials
The GCS scanner supports multiple authentication methods. Choose the one that best fits your environment.
Option A: Service Account JSON Key File (recommended for most deployments)
A service account JSON key file provides explicit, portable credentials that work in any environment. This is the recommended method for Diskover deployments.
-
Create a service account (skip if you already have one):
gcloud iam service-accounts create diskover-scanner \ --display-name="Diskover GCS Scanner" \ --description="Service account for Diskover GCS scanning" -
Grant the required IAM roles (see IAM Requirements above):
PROJECT_ID=$(gcloud config get-value project) SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$SA_EMAIL" \ --role="roles/storage.objectViewer" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$SA_EMAIL" \ --role="roles/storage.legacyBucketReader" -
Generate and download the JSON key file:
gcloud iam service-accounts keys create /opt/diskover/gcs-credentials.json \ --iam-account=$SA_EMAIL -
Secure the key file:
chmod 600 /opt/diskover/gcs-credentials.json chown diskover:diskover /opt/diskover/gcs-credentials.json
-
Set the environment variable for the scanner:
export GCS_CREDENTIALS_FILE="/opt/diskover/gcs-credentials.json"
On Windows:
$env:GCS_CREDENTIALS_FILE = "C:\Program Files\Diskover\gcs-credentials.json"
Option B: Application Default Credentials (ADC)
ADC uses the standard GOOGLE_APPLICATION_CREDENTIALS environment variable or gcloud CLI credentials. This method is useful for development environments.
Using a service account key file via ADC:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
On Windows:
$env:GOOGLE_APPLICATION_CREDENTIALS = "C:\path\to\service-account-key.json"
Using gcloud CLI credentials (development only):
gcloud auth application-default login
Option C: GCE/GKE Metadata Server (when running on Google Cloud infrastructure)
When running on Compute Engine, GKE, or Cloud Run, the scanner automatically uses the attached service account via the metadata server. No explicit credential configuration is needed — just ensure the VM or pod service account has the required IAM roles.
Verify the metadata server is available and a service account is attached:
curl -s -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
Step 5: Verify GCS Connectivity
Test that the scanner can reach your GCS buckets:
python3 -c "
from google.cloud import storage
import os
creds_file = os.getenv('GCS_CREDENTIALS_FILE', '')
if creds_file:
client = storage.Client.from_service_account_json(creds_file)
else:
client = storage.Client()
buckets = list(client.list_buckets())
print(f'Found {len(buckets)} buckets:')
for bucket in buckets[:5]:
print(f' - {bucket.name}')
"
You should see a list of your accessible buckets:
Configuration
Configuration is managed through the Diskover Admin UI under Diskover > Alternate Scanners > GCS.
Configuration Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
(empty) |
Path to a Google Cloud service account JSON key file. If empty, falls through to Application Default Credentials. Can be overridden by the |
|
string |
(empty) |
GCP project ID. Optional — inferred from credentials if not set. Can be overridden by the |
|
int |
|
Maximum number of results per API page when listing blobs. Increase for large buckets to reduce API call count. |
YAML Configuration Example
# Diskover Admin Configuration
Diskover:
Alternate Scanners:
GCS:
Default:
credentials_file: /opt/diskover/gcs-credentials.json
project_id: my-gcp-project-id
page_size: 1000
Task-Level Environment Variables
These variables are set per scan task (not in global configuration) since different tasks may target different buckets or projects:
Environment Variable |
Description |
|---|---|
|
Path to a Google Cloud service account JSON key file. Overrides the |
|
GCP project ID. Overrides the |
|
Default bucket name. Can be overridden by the command-line path argument. |
|
Standard GCP environment variable for Application Default Credentials. Used as fallback if |
Authentication Resolution Order
The scanner resolves credentials in the following priority order:
Priority |
Source |
Description |
|---|---|---|
1 |
|
Task-level environment variable pointing to a service account JSON key file |
2 |
|
Configuration parameter set via Admin UI or YAML |
3 |
|
Standard GCP ADC environment variable |
4 |
gcloud CLI credentials |
Credentials from |
5 |
GCE/GKE metadata server |
Automatic credentials on Google Cloud infrastructure |
Note: Credentials and project ID are typically configured via environment variables set per scan task rather than in the Admin UI, since different tasks may scan different projects or use different credentials.
Configuration via Diskover Admin
Navigate to Diskover Admin > Diskover > Alternate Scanners > GCS
Set the credentials file path, or leave empty for ADC
Optionally set the GCP project ID
Adjust
page_sizefor API pagination tuningSave the configuration
Sample Configuration in Diskover Admin:
Here is the beginning of our sample configuration There are many other configuraitons for the GCS Scanner - covered in detail below!
Configuration Examples
Standard Configuration — Service Account Key File
This is the recommended configuration for most deployments:
Diskover:
Alternate Scanners:
GCS:
Default:
credentials_file: /opt/diskover/gcs-credentials.json
project_id: my-gcp-project-id
page_size: 1000
High-Throughput Configuration for Large Buckets
Increase the page size for scanning buckets with millions of objects:
Diskover:
Alternate Scanners:
GCS:
Default:
credentials_file: /opt/diskover/gcs-credentials.json
project_id: my-gcp-project-id
page_size: 5000
Usage
The GCS scanner uses the standard --altscanner flag with diskover.py.
Basic Commands
Scan a specific bucket:
# Linux cd /opt/diskover python3 diskover.py --altscanner gcs gs://mybucket # Windows cd "C:\Program Files\Diskover" python diskover.py --altscanner gcs gs://mybucket
Scan all accessible buckets in the GCP project:
python3 diskover.py --altscanner gcs gs://
This automatically discovers all buckets and creates separate top paths for each one (e.g., /mybucket1, /mybucket2).
Scan a specific prefix (folder) within a bucket:
python3 diskover.py --altscanner gcs gs://mybucket/data/2024
Scan with a custom Elasticsearch index name:
python3 diskover.py -i diskover-gcs-data --altscanner gcs gs://mybucket
Scan multiple specific buckets into one index:
python3 diskover.py -i diskover-gcs-all --altscanner gcs gs://bucket1 gs://bucket2 gs://bucket3
Enable debug logging for troubleshooting:
python3 diskover.py --altscanner gcs --loglevel DEBUG gs://mybucket
Scan with specific credentials:
export GCS_CREDENTIALS_FILE="/opt/diskover/gcs-credentials.json" python3 diskover.py --altscanner scandir_gcp gs://mybucket
Path Format Reference
Path Format |
Description |
Example |
|---|---|---|
|
Scan all accessible buckets in the GCP project |
|
|
Scan an entire bucket |
|
|
Scan a specific object prefix |
|
|
Scan prefix (trailing slash is optional) |
|
Integration with Index Tasks
When configuring the GCS scanner as part of a scheduled Index Task:
Field |
Value |
|---|---|
Alternate Scanner |
|
Set the scan path to your desired gs:// path in the task configuration. Set any task-level environment variables (GCS_CREDENTIALS_FILE, GCS_PROJECT_ID) in the task environment.
Performance Tips
The page_size parameter controls the number of objects returned per API page when listing blobs. Adjust this value based on your bucket size and network conditions:
Scenario |
Recommended Page Size |
Notes |
|---|---|---|
Small buckets (<100K objects) |
1000 (default) |
Default is sufficient |
Medium buckets (100K–1M objects) |
2000 |
Reduces API call count |
Large buckets (>1M objects) |
5000 |
Balance with memory and API quotas |
For best performance when scanning from Google Cloud infrastructure, deploy Diskover in the same region as your GCS buckets and use Private Google Access to avoid internet routing overhead. The scanner automatically sizes its HTTP connection pool to match Diskover's thread count, so no manual connection tuning is needed.
Sample CLI Scan:
Sample Index Task Scan:
Metadata Fields
The GCS scanner adds three custom metadata fields to every indexed object, alongside the standard file metadata (name, path, size, timestamps, etc.).
Field Mappings
Field Path |
ES Type |
Description |
|---|---|---|
|
|
Base64-encoded MD5 hash of the object content. Empty for composite objects (created via compose operations). Useful for deduplication and integrity verification. |
|
|
Base64-encoded CRC32C checksum of the object content. Available for all objects including composite objects. Useful for fast integrity validation. |
|
|
GCS storage class: |
Elasticsearch Mapping Definition
{
"mappings": {
"properties": {
"gcs_md5hash": {
"type": "keyword"
},
"gcs_crc32c": {
"type": "keyword"
},
"gcs_storageclass": {
"type": "keyword"
}
}
}
}
Example Indexed Document
Here's what a typical GCS object looks like once indexed:
{
"name": "quarterly_report_2024_q1.xlsx",
"path": "/mybucket/reports/finance/quarterly_report_2024_q1.xlsx",
"path_parent": "/mybucket/reports/finance",
"extension": "xlsx",
"size": 1048576,
"size_du": 1048576,
"mtime": "2024-04-15T10:30:00Z",
"atime": "2024-04-15T10:30:00Z",
"ctime": "2024-04-15T10:30:00Z",
"type": "file",
"gcs_md5hash": "2LMiq0clNJ6/QEG+r80EMA==",
"gcs_crc32c": "ss6JuQ==",
"gcs_storageclass": "STANDARD"
}
Searching in Diskover
Once GCS data is indexed, you can use Diskover's standard search syntax to query GCS-specific metadata fields.
Search Query Examples
By Storage Class
Query |
Description |
|---|---|
|
Find all objects in the STANDARD storage class |
|
Find all objects in NEARLINE |
|
Find all archived objects |
|
Find objects in legacy storage classes |
Cost Optimization Queries
Query |
Description |
|---|---|
|
STANDARD objects older than 90 days — candidates for NEARLINE |
|
Large STANDARD objects over 1 GB — high-cost candidates |
|
Objects older than 1 year still in expensive tiers |
|
Video files in STANDARD — consider COLDLINE or ARCHIVE |
|
Objects older than 7 years (retention policy review) |
Checksum Queries
Query |
Description |
|---|---|
|
Find objects with a specific MD5 hash (exact match for deduplication) |
|
Find composite objects missing MD5 hash |
|
Find all objects with CRC32C checksums (should be all objects) |
Combined Queries
Query |
Description |
|---|---|
|
Old log files still in STANDARD — good cleanup candidates |
|
Potentially sensitive files in general-access locations |
Sample Metadata from GCS Scanner Output:
Troubleshooting
Common Issues
Issue |
Cause |
Solution |
|---|---|---|
"Could not automatically determine credentials" |
No GCP credentials configured |
Set |
"403 Forbidden" or "Caller does not have storage.objects.list" |
Missing IAM permissions |
Grant |
Bucket not found / skipped during scan |
Bucket name incorrect or wrong project |
Verify bucket name (case-sensitive, globally unique) and set the correct |
"No module named 'google.cloud'" |
google-cloud-storage not installed |
Run |
"ValueError: Service account info was not in the expected format" |
Corrupted or invalid JSON key file |
Regenerate the key file via |
No buckets discovered with |
Missing |
Grant |
Wrong project buckets listed |
Incorrect project configured |
Set |
Empty |
Composite objects don't have MD5 hashes in GCS |
Expected behavior — use |
Slow scan performance |
Low |
Increase |
Connection timeout or DNS resolution errors |
Network connectivity issues to |
Verify firewall allows outbound HTTPS (port 443) to |
"Connection pool is full" warnings |
Many concurrent threads |
The scanner automatically sizes the connection pool to match Diskover's thread count — this warning should not occur under normal conditions |
Debug Logging
Enable detailed logging to diagnose issues:
# Linux python3 /opt/diskover/diskover.py --altscanner scandir_gcp --loglevel DEBUG gs://mybucket # Windows cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_gcp --loglevel DEBUG gs://mybucket
Log File Locations
Linux:
/var/log/diskover/diskover.logWindows: Check the Diskover service logs or your configured log location
Diagnosing Authentication Issues
Step 1 — Check credentials configuration:
echo "GCS_CREDENTIALS_FILE: $GCS_CREDENTIALS_FILE" echo "GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS"
On Windows:
Write-Output "GCS_CREDENTIALS_FILE: $env:GCS_CREDENTIALS_FILE" Write-Output "GOOGLE_APPLICATION_CREDENTIALS: $env:GOOGLE_APPLICATION_CREDENTIALS"
Step 2 — Verify the credentials file is valid JSON:
python3 -c "
import json, os
f = os.getenv('GCS_CREDENTIALS_FILE', '')
with open(f) as fh:
data = json.load(fh)
required = ['type', 'project_id', 'private_key_id', 'private_key', 'client_email']
for field in required:
print(f'{field}: {\"present\" if field in data else \"MISSING\"}')
print(f'Type: {data.get(\"type\", \"unknown\")}')
"
Step 3 — Test authentication end-to-end:
python3 -c "
from google.cloud import storage
import os
creds_file = os.getenv('GCS_CREDENTIALS_FILE', '')
try:
if creds_file:
client = storage.Client.from_service_account_json(creds_file)
else:
client = storage.Client()
buckets = list(client.list_buckets())
print(f'Authentication successful - found {len(buckets)} buckets')
except Exception as e:
print(f'Authentication failed: {e}')
"
Diagnosing Network Connectivity
Test HTTPS connectivity to Google Cloud Storage:
curl -I "https://storage.googleapis.com/"
Check DNS resolution:
nslookup storage.googleapis.com
If using a proxy, ensure it's configured:
export HTTPS_PROXY="http://proxy.example.com:8080"
Support
Last Updated: April 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.