GCP - Google Cloud Storage

License: PRO+ (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.

Overview

The Diskover GCS Scanner brings your Google Cloud Storage into Diskover, letting you search, browse, and analyze objects stored across GCS buckets just like you would with on-premises files. Instead of clicking through the Google Cloud Console bucket by bucket, you get a single, unified view of all your cloud data — complete with GCS-specific metadata like storage classes, MD5 hashes, and CRC32C checksums.

Whether you're tracking down a specific file buried across multiple buckets, figuring out which storage tiers are costing you the most, or building an asset inventory for compliance, this scanner gives you the visibility you need.

Use Cases

Data Discovery and Cataloging
Organizations with data distributed across multiple GCS buckets often struggle to know what's where. The GCS scanner indexes all accessible buckets in a single operation, giving teams a unified, searchable view of cloud storage assets. You can locate specific files across buckets in seconds rather than manually browsing through the Google Cloud Console — making it easy to build comprehensive inventories for licensing, rights management, or data governance initiatives.

Cost Optimization
GCS offers multiple storage tiers with dramatically different price points, and understanding how your data is distributed across those tiers is the first step to reducing costs. By indexing storage class metadata alongside file age and size, you can quickly identify candidates for tier transitions — for example, old STANDARD-class data that should be moved to NEARLINE or COLDLINE. Track tier distribution over time to measure the impact of your optimization efforts.

Compliance and Governance
Organizations in regulated industries must maintain accurate records of data storage, demonstrate data handling practices, and respond to audit requests. The GCS scanner enables compliance workflows by indexing all objects with timestamps, MD5 hashes, and CRC32C checksums, allowing teams to verify data retention policies, detect unauthorized data placement, and accelerate audit response times from weeks to hours through automated searching and reporting.

Understanding Google Cloud Storage

Google Cloud Storage (GCS) is Google's highly durable, scalable object storage service. Like other object stores, GCS organizes data into buckets (top-level containers) that hold objects (your files). While GCS is technically a flat key-value store, it uses forward slashes (/) in object keys to simulate a folder hierarchy — and the Diskover GCS scanner translates this into the familiar directory structure you see in the Diskover Web UI.

Storage Classes

One of the most important pieces of metadata the scanner captures is the storage class, which determines both how quickly you can access an object and how much it costs to store. Here's a quick reference:

Storage Class	Access Speed	Use Case	Relative Cost
`STANDARD`	Milliseconds	Frequently accessed data	Highest
`NEARLINE`	Milliseconds	Data accessed less than once a month	Lower
`COLDLINE`	Milliseconds	Data accessed less than once a quarter	Lower still
`ARCHIVE`	Milliseconds	Long-term archive, accessed less than once a year	Lowest
`MULTI_REGIONAL`	Milliseconds	Legacy class — high availability across regions	Varies
`REGIONAL`	Milliseconds	Legacy class — data stored in a single region	Varies
`DURABLE_REDUCED_AVAILABILITY`	Milliseconds	Legacy class — reduced availability	Varies

Understanding these tiers matters because Diskover lets you search by storage class — so you can find expensive STANDARD-class objects that haven't been accessed in months and should be transitioned to a cheaper tier.

Note: MULTI_REGIONAL, REGIONAL, and DURABLE_REDUCED_AVAILABILITY are legacy storage classes. Google recommends using STANDARD, NEARLINE, COLDLINE, or ARCHIVE for new data. The scanner captures whichever class GCS reports for existing objects.

Checksums: MD5 Hashes and CRC32C

The scanner captures two types of checksums for each object:

Checksum	Available For	Use Case
MD5 Hash (`gcs_md5hash`)	Single-upload objects	Content deduplication, integrity verification, change detection
CRC32C (`gcs_crc32c`)	All objects (including composite)	Fast integrity validation

Important: Objects created via gsutil compose or the compose API (called composite objects) do not have MD5 hashes — this is a GCS platform limitation, not a scanner issue. These objects will have an empty gcs_md5hash field but will always have a CRC32C checksum.

How the Scanner Maps GCS to Diskover

GCS's flat namespace gets converted into a hierarchical directory structure for Diskover:

GCS Location	Diskover Index Path
Bucket `mybucket`	`/mybucket`
Object `mybucket/folder/file.txt`	`/mybucket/folder/file.txt`
Object `mybucket/deep/nested/doc.pdf`	`/mybucket/deep/nested/doc.pdf`

This means you can browse GCS buckets in Diskover's file browser just like a normal filesystem, and all standard search queries work against the indexed data.

Requirements

System Requirements

Component	Requirement
Python	3.9 or higher
Diskover	Core installation with alternate scanner support
Network	HTTPS access to Google Cloud Storage endpoints (`storage.googleapis.com`)

Python Dependencies

Package	Version	Purpose
`google-cloud-storage`	2.x	Google Cloud Storage client library for bucket and blob operations
`google-auth`	2.x	Google authentication library (installed automatically with google-cloud-storage)
`google-api-core`	2.x	Google API core library for exception handling (installed automatically with google-cloud-storage)

Google Cloud IAM Requirements

The scanner requires read-only access to GCS buckets. Getting IAM permissions right is critical — this section walks through exactly what's needed.

Minimum Required IAM Roles

IAM Role	What It Grants	Why the Scanner Needs It
`roles/storage.objectViewer`	`storage.objects.get`, `storage.objects.list`	List and read objects within buckets
`roles/storage.legacyBucketReader`	`storage.buckets.get`, `storage.buckets.list`	List all buckets in a project (required for `gs://` all-bucket scanning)

Tip: If you only plan to scan specific named buckets (e.g., gs://mybucket) rather than auto-discovering all buckets with gs://, you can skip roles/storage.legacyBucketReader and grant only roles/storage.objectViewer on individual buckets.

Granting Project-Level Access (All Buckets)

To allow the scanner to discover and scan all buckets in a GCP project:

PROJECT_ID="your-project-id"
SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com"

# Grant object read access across all buckets
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SA_EMAIL" \
    --role="roles/storage.objectViewer"

# Grant bucket listing access (required for gs:// all-bucket scanning)
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SA_EMAIL" \
    --role="roles/storage.legacyBucketReader"

Granting Bucket-Level Access (Specific Buckets Only)

For tighter security, restrict access to only the buckets the scanner needs:

SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com"

# Grant access to a specific bucket
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
    --member="serviceAccount:$SA_EMAIL" \
    --role="roles/storage.objectViewer"

Note: When granting bucket-level access, you must specify each bucket individually. The scanner will skip any buckets it cannot access and log a warning.

Verifying IAM Permissions

To confirm a service account has the correct roles:

# Check project-level roles
gcloud projects get-iam-policy $PROJECT_ID \
    --flatten="bindings[].members" \
    --filter="bindings.members:serviceAccount:diskover-scanner@*" \
    --format="table(bindings.role)"

# Check bucket-level roles
gcloud storage buckets get-iam-policy gs://BUCKET_NAME

Custom IAM Policy (Least Privilege)

If your organization requires a custom IAM policy instead of predefined roles, here are the exact permissions the scanner needs:

{
    "Version": "1",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "storage.objects.get",
                "storage.objects.list",
                "storage.buckets.get",
                "storage.buckets.list"
            ],
            "Resource": "*"
        }
    ]
}

Important: If you're using Uniform Bucket-Level Access on your buckets (recommended by Google), ACLs are disabled and all access is controlled through IAM. If you're using Fine-Grained access control, ensure the service account is also granted read access through ACLs. When in doubt, use Uniform Bucket-Level Access — it's simpler to manage and audit.

Installation

Step 1: Install the Scanner Package

Linux:

dnf install diskover-scanner-gcp

Windows:

The scanner files are included with the Diskover Windows installation. No separate installation step is required.

Install locations:

Linux: /opt/diskover/scanners/scandir_gcp/
Windows: C:\Program Files\Diskover\scanners\scandir_gcp\

Step 2: Install Python Dependencies

Install the Google Cloud Storage client library:

python3 -m pip install google-cloud-storage

Step 3: Verify Installation

Confirm google-cloud-storage is installed and accessible:

python3 -c "from google.cloud import storage; print(f'google-cloud-storage: OK')"

You should see google-cloud-storage: OK printed without errors.

Check the installed version:

python3 -m pip show google-cloud-storage | grep -E "^(Name|Version)"

Step 4: Configure GCP Credentials

The GCS scanner supports multiple authentication methods. Choose the one that best fits your environment.

Option A: Service Account JSON Key File (recommended for most deployments)

A service account JSON key file provides explicit, portable credentials that work in any environment. This is the recommended method for Diskover deployments.

Create a service account (skip if you already have one):

gcloud iam service-accounts create diskover-scanner \
    --display-name="Diskover GCS Scanner" \
    --description="Service account for Diskover GCS scanning"

Grant the required IAM roles (see IAM Requirements above):

PROJECT_ID=$(gcloud config get-value project)
SA_EMAIL="diskover-scanner@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SA_EMAIL" \
    --role="roles/storage.objectViewer"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SA_EMAIL" \
    --role="roles/storage.legacyBucketReader"

Generate and download the JSON key file:

gcloud iam service-accounts keys create /opt/diskover/gcs-credentials.json \
    --iam-account=$SA_EMAIL

Secure the key file:

chmod 600 /opt/diskover/gcs-credentials.json
chown diskover:diskover /opt/diskover/gcs-credentials.json

Set the environment variable for the scanner:

export GCS_CREDENTIALS_FILE="/opt/diskover/gcs-credentials.json"

On Windows:

$env:GCS_CREDENTIALS_FILE = "C:\Program Files\Diskover\gcs-credentials.json"

Option B: Application Default Credentials (ADC)

ADC uses the standard GOOGLE_APPLICATION_CREDENTIALS environment variable or gcloud CLI credentials. This method is useful for development environments.

Using a service account key file via ADC:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

On Windows:

$env:GOOGLE_APPLICATION_CREDENTIALS = "C:\path\to\service-account-key.json"

Using gcloud CLI credentials (development only):

gcloud auth application-default login

Option C: GCE/GKE Metadata Server (when running on Google Cloud infrastructure)

When running on Compute Engine, GKE, or Cloud Run, the scanner automatically uses the attached service account via the metadata server. No explicit credential configuration is needed — just ensure the VM or pod service account has the required IAM roles.

Verify the metadata server is available and a service account is attached:

curl -s -H "Metadata-Flavor: Google" \
    http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email

Step 5: Verify GCS Connectivity

Test that the scanner can reach your GCS buckets:

python3 -c "
from google.cloud import storage
import os
creds_file = os.getenv('GCS_CREDENTIALS_FILE', '')
if creds_file:
    client = storage.Client.from_service_account_json(creds_file)
else:
    client = storage.Client()
buckets = list(client.list_buckets())
print(f'Found {len(buckets)} buckets:')
for bucket in buckets[:5]:
    print(f'  - {bucket.name}')
"

You should see a list of your accessible buckets:

Configuration

Configuration is managed through the Diskover Admin UI under Diskover > Alternate Scanners > GCS.

Configuration Parameters

Parameter	Type	Default	Description
`credentials_file`	string	(empty)	Path to a Google Cloud service account JSON key file. If empty, falls through to Application Default Credentials. Can be overridden by the `GCS_CREDENTIALS_FILE` environment variable.
`project_id`	string	(empty)	GCP project ID. Optional — inferred from credentials if not set. Can be overridden by the `GCS_PROJECT_ID` environment variable.
`page_size`	int	`1000`	Maximum number of results per API page when listing blobs. Increase for large buckets to reduce API call count.

YAML Configuration Example

# Diskover Admin Configuration
Diskover:
  Alternate Scanners:
    GCS:
      Default:
        credentials_file: /opt/diskover/gcs-credentials.json
        project_id: my-gcp-project-id
        page_size: 1000

Task-Level Environment Variables

These variables are set per scan task (not in global configuration) since different tasks may target different buckets or projects:

Environment Variable	Description
`GCS_CREDENTIALS_FILE`	Path to a Google Cloud service account JSON key file. Overrides the `credentials_file` configuration parameter.
`GCS_PROJECT_ID`	GCP project ID. Overrides the `project_id` configuration parameter.
`GCS_BUCKET`	Default bucket name. Can be overridden by the command-line path argument.
`GOOGLE_APPLICATION_CREDENTIALS`	Standard GCP environment variable for Application Default Credentials. Used as fallback if `GCS_CREDENTIALS_FILE` is not set.

Authentication Resolution Order

The scanner resolves credentials in the following priority order:

Priority	Source	Description
1	`GCS_CREDENTIALS_FILE` env var	Task-level environment variable pointing to a service account JSON key file
2	`credentials_file` config	Configuration parameter set via Admin UI or YAML
3	`GOOGLE_APPLICATION_CREDENTIALS` env var	Standard GCP ADC environment variable
4	gcloud CLI credentials	Credentials from `gcloud auth application-default login`
5	GCE/GKE metadata server	Automatic credentials on Google Cloud infrastructure

Note: Credentials and project ID are typically configured via environment variables set per scan task rather than in the Admin UI, since different tasks may scan different projects or use different credentials.

Configuration via Diskover Admin

Navigate to Diskover Admin > Diskover > Alternate Scanners > GCS
Set the credentials file path, or leave empty for ADC
Optionally set the GCP project ID
Adjust page_size for API pagination tuning
Save the configuration

Sample Configuration in Diskover Admin:

Here is the beginning of our sample configuration There are many other configuraitons for the GCS Scanner - covered in detail below!

Configuration Examples

Standard Configuration — Service Account Key File

This is the recommended configuration for most deployments:

Diskover:
  Alternate Scanners:
    GCS:
      Default:
        credentials_file: /opt/diskover/gcs-credentials.json
        project_id: my-gcp-project-id
        page_size: 1000

High-Throughput Configuration for Large Buckets

Increase the page size for scanning buckets with millions of objects:

Diskover:
  Alternate Scanners:
    GCS:
      Default:
        credentials_file: /opt/diskover/gcs-credentials.json
        project_id: my-gcp-project-id
        page_size: 5000

Usage

The GCS scanner uses the standard --altscanner flag with diskover.py.

Basic Commands

Scan a specific bucket:

# Linux
cd /opt/diskover
python3 diskover.py --altscanner gcs gs://mybucket

# Windows
cd "C:\Program Files\Diskover"
python diskover.py --altscanner gcs gs://mybucket

Scan all accessible buckets in the GCP project:

python3 diskover.py --altscanner gcs gs://

This automatically discovers all buckets and creates separate top paths for each one (e.g., /mybucket1, /mybucket2).

Scan a specific prefix (folder) within a bucket:

python3 diskover.py --altscanner gcs gs://mybucket/data/2024

Scan with a custom Elasticsearch index name:

python3 diskover.py -i diskover-gcs-data --altscanner gcs gs://mybucket

Scan multiple specific buckets into one index:

python3 diskover.py -i diskover-gcs-all --altscanner gcs gs://bucket1 gs://bucket2 gs://bucket3

Enable debug logging for troubleshooting:

python3 diskover.py --altscanner gcs --loglevel DEBUG gs://mybucket

Scan with specific credentials:

export GCS_CREDENTIALS_FILE="/opt/diskover/gcs-credentials.json"
python3 diskover.py --altscanner scandir_gcp gs://mybucket

Path Format Reference

Path Format	Description	Example
`gs://`	Scan all accessible buckets in the GCP project	`gs://`
`gs://<bucket>`	Scan an entire bucket	`gs://mybucket`
`gs://<bucket>/<prefix>`	Scan a specific object prefix	`gs://mybucket/data/2024`
`gs://<bucket>/<prefix>/`	Scan prefix (trailing slash is optional)	`gs://mybucket/archives/`

Integration with Index Tasks

When configuring the GCS scanner as part of a scheduled Index Task:

Field	Value
Alternate Scanner	`scandir_gcs`

Set the scan path to your desired gs:// path in the task configuration. Set any task-level environment variables (GCS_CREDENTIALS_FILE, GCS_PROJECT_ID) in the task environment.

Performance Tips

The page_size parameter controls the number of objects returned per API page when listing blobs. Adjust this value based on your bucket size and network conditions:

Scenario	Recommended Page Size	Notes
Small buckets (<100K objects)	1000 (default)	Default is sufficient
Medium buckets (100K–1M objects)	2000	Reduces API call count
Large buckets (>1M objects)	5000	Balance with memory and API quotas

For best performance when scanning from Google Cloud infrastructure, deploy Diskover in the same region as your GCS buckets and use Private Google Access to avoid internet routing overhead. The scanner automatically sizes its HTTP connection pool to match Diskover's thread count, so no manual connection tuning is needed.

Sample CLI Scan:

Sample Index Task Scan:

Metadata Fields

The GCS scanner adds three custom metadata fields to every indexed object, alongside the standard file metadata (name, path, size, timestamps, etc.).

Field Mappings

Field Path	ES Type	Description
`gcs_md5hash`	`keyword`	Base64-encoded MD5 hash of the object content. Empty for composite objects (created via compose operations). Useful for deduplication and integrity verification.
`gcs_crc32c`	`keyword`	Base64-encoded CRC32C checksum of the object content. Available for all objects including composite objects. Useful for fast integrity validation.
`gcs_storageclass`	`keyword`	GCS storage class: `STANDARD`, `NEARLINE`, `COLDLINE`, `ARCHIVE`, `MULTI_REGIONAL`, `REGIONAL`, or `DURABLE_REDUCED_AVAILABILITY`.

Elasticsearch Mapping Definition

{
  "mappings": {
    "properties": {
      "gcs_md5hash": {
        "type": "keyword"
      },
      "gcs_crc32c": {
        "type": "keyword"
      },
      "gcs_storageclass": {
        "type": "keyword"
      }
    }
  }
}

Example Indexed Document

Here's what a typical GCS object looks like once indexed:

{
  "name": "quarterly_report_2024_q1.xlsx",
  "path": "/mybucket/reports/finance/quarterly_report_2024_q1.xlsx",
  "path_parent": "/mybucket/reports/finance",
  "extension": "xlsx",
  "size": 1048576,
  "size_du": 1048576,
  "mtime": "2024-04-15T10:30:00Z",
  "atime": "2024-04-15T10:30:00Z",
  "ctime": "2024-04-15T10:30:00Z",
  "type": "file",
  "gcs_md5hash": "2LMiq0clNJ6/QEG+r80EMA==",
  "gcs_crc32c": "ss6JuQ==",
  "gcs_storageclass": "STANDARD"
}

Searching in Diskover

Once GCS data is indexed, you can use Diskover's standard search syntax to query GCS-specific metadata fields.

Search Query Examples

By Storage Class

Query	Description
`gcs_storageclass:STANDARD`	Find all objects in the STANDARD storage class
`gcs_storageclass:NEARLINE`	Find all objects in NEARLINE
`gcs_storageclass:(COLDLINE OR ARCHIVE)`	Find all archived objects
`gcs_storageclass:(MULTI_REGIONAL OR REGIONAL)`	Find objects in legacy storage classes

Cost Optimization Queries

Query	Description
`gcs_storageclass:STANDARD AND mtime:<now-90d`	STANDARD objects older than 90 days — candidates for NEARLINE
`gcs_storageclass:STANDARD AND size:>1073741824`	Large STANDARD objects over 1 GB — high-cost candidates
`gcs_storageclass:(STANDARD OR NEARLINE) AND mtime:<now-365d`	Objects older than 1 year still in expensive tiers
`gcs_storageclass:STANDARD AND extension:(mp4 OR mov OR avi)`	Video files in STANDARD — consider COLDLINE or ARCHIVE
`mtime:<now-2555d`	Objects older than 7 years (retention policy review)

Checksum Queries

Query	Description
`gcs_md5hash:2LMiq0clNJ6/QEG+r80EMA==`	Find objects with a specific MD5 hash (exact match for deduplication)
`gcs_md5hash:""`	Find composite objects missing MD5 hash
`gcs_crc32c:*`	Find all objects with CRC32C checksums (should be all objects)

Combined Queries

Query	Description
`path_parent:logs AND gcs_storageclass:STANDARD AND mtime:<now-30d`	Old log files still in STANDARD — good cleanup candidates
`(path_parent:public OR path_parent:shared) AND name:confidential`	Potentially sensitive files in general-access locations

Sample Metadata from GCS Scanner Output:

Troubleshooting

Common Issues

Issue	Cause	Solution
"Could not automatically determine credentials"	No GCP credentials configured	Set `GCS_CREDENTIALS_FILE`, `GOOGLE_APPLICATION_CREDENTIALS`, or configure ADC
"403 Forbidden" or "Caller does not have storage.objects.list"	Missing IAM permissions	Grant `roles/storage.objectViewer` and `roles/storage.legacyBucketReader` (see Requirements section)
Bucket not found / skipped during scan	Bucket name incorrect or wrong project	Verify bucket name (case-sensitive, globally unique) and set the correct `GCS_PROJECT_ID`
"No module named 'google.cloud'"	google-cloud-storage not installed	Run `pip3 install google-cloud-storage`
"ValueError: Service account info was not in the expected format"	Corrupted or invalid JSON key file	Regenerate the key file via `gcloud iam service-accounts keys create`
No buckets discovered with `gs://`	Missing `storage.buckets.list` permission	Grant `roles/storage.legacyBucketReader`, or scan specific buckets by name instead
Wrong project buckets listed	Incorrect project configured	Set `GCS_PROJECT_ID` explicitly or verify the project in the credentials file
Empty `gcs_md5hash` for some objects	Composite objects don't have MD5 hashes in GCS	Expected behavior — use `gcs_crc32c` for integrity validation of composite objects
Slow scan performance	Low `page_size` or cross-region network latency	Increase `page_size`, deploy in the same region as buckets, enable Private Google Access
Connection timeout or DNS resolution errors	Network connectivity issues to `storage.googleapis.com`	Verify firewall allows outbound HTTPS (port 443) to `*.googleapis.com`
"Connection pool is full" warnings	Many concurrent threads	The scanner automatically sizes the connection pool to match Diskover's thread count — this warning should not occur under normal conditions

Debug Logging

Enable detailed logging to diagnose issues:

# Linux
python3 /opt/diskover/diskover.py --altscanner scandir_gcp --loglevel DEBUG gs://mybucket

# Windows
cd "C:\Program Files\Diskover"
python diskover.py --altscanner scandir_gcp --loglevel DEBUG gs://mybucket

Log File Locations

Linux: /var/log/diskover/diskover.log
Windows: Check the Diskover service logs or your configured log location

Diagnosing Authentication Issues

Step 1 — Check credentials configuration:

echo "GCS_CREDENTIALS_FILE: $GCS_CREDENTIALS_FILE"
echo "GOOGLE_APPLICATION_CREDENTIALS: $GOOGLE_APPLICATION_CREDENTIALS"

On Windows:

Write-Output "GCS_CREDENTIALS_FILE: $env:GCS_CREDENTIALS_FILE"
Write-Output "GOOGLE_APPLICATION_CREDENTIALS: $env:GOOGLE_APPLICATION_CREDENTIALS"

Step 2 — Verify the credentials file is valid JSON:

python3 -c "
import json, os
f = os.getenv('GCS_CREDENTIALS_FILE', '')
with open(f) as fh:
    data = json.load(fh)
required = ['type', 'project_id', 'private_key_id', 'private_key', 'client_email']
for field in required:
    print(f'{field}: {\"present\" if field in data else \"MISSING\"}')
print(f'Type: {data.get(\"type\", \"unknown\")}')
"

Step 3 — Test authentication end-to-end:

python3 -c "
from google.cloud import storage
import os
creds_file = os.getenv('GCS_CREDENTIALS_FILE', '')
try:
    if creds_file:
        client = storage.Client.from_service_account_json(creds_file)
    else:
        client = storage.Client()
    buckets = list(client.list_buckets())
    print(f'Authentication successful - found {len(buckets)} buckets')
except Exception as e:
    print(f'Authentication failed: {e}')
"

Diagnosing Network Connectivity

Test HTTPS connectivity to Google Cloud Storage:

curl -I "https://storage.googleapis.com/"

Check DNS resolution:

nslookup storage.googleapis.com

If using a proxy, ensure it's configured:

export HTTPS_PROXY="http://proxy.example.com:8080"

Support

Last Updated: April 2026
Diskover Data, Inc.

GCP - Google Cloud Storage

Overview

Use Cases

Understanding Google Cloud Storage

Storage Classes

Checksums: MD5 Hashes and CRC32C

How the Scanner Maps GCS to Diskover

Requirements

System Requirements

Python Dependencies

Google Cloud IAM Requirements

Minimum Required IAM Roles

Granting Project-Level Access (All Buckets)

Granting Bucket-Level Access (Specific Buckets Only)

Verifying IAM Permissions

Custom IAM Policy (Least Privilege)

Installation

Step 1: Install the Scanner Package

Step 2: Install Python Dependencies

Step 3: Verify Installation

Step 4: Configure GCP Credentials

Step 5: Verify GCS Connectivity

Configuration

Configuration Parameters

YAML Configuration Example

Task-Level Environment Variables

Authentication Resolution Order

Configuration via Diskover Admin

Configuration Examples

Usage

Basic Commands

Path Format Reference

Integration with Index Tasks

Performance Tips

Metadata Fields

Field Mappings

Elasticsearch Mapping Definition

Example Indexed Document

Searching in Diskover

Search Query Examples

Troubleshooting

Common Issues

Debug Logging

Log File Locations

Diagnosing Authentication Issues

Diagnosing Network Connectivity

Support

Related articles