Google Workspace - Google Drive
License: PRO+ (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview / Use Cases
The Google Drive Alternate Scanner (scandir_google) gives Diskover visibility into your organization's Google Drive content by indexing files and folders through the Google Drive REST API (v3). It translates Google Drive's parent-child folder hierarchy into a POSIX-style path structure that Diskover can traverse, index, and search — turning cloud-resident data into searchable, analyzable metadata alongside your on-premises file system indexes.
The scanner handles personal drives ("My Drive"), shared drives (Team Drives), and direct folder-ID targeting via the gdrive-id:// URI scheme. Google Apps native files (Docs, Sheets, Slides, Forms, Drawings, Sites) are indexed with estimated sizes and virtual extensions (.gdoc, .gsheet, .gslides, etc.) so they appear alongside regular files in search results.
Key Use Cases
Cloud Storage Visibility (Storage Administrators & IT)
Files scattered across hundreds of user drives and dozens of shared drives are invisible at the filesystem level. The scanner indexes every accessible drive into Diskover, giving administrators a single search interface to locate files across the entire Google Workspace tenant without hunting through the Google Drive web UI.
Shared Drive Governance (Data Managers)
Shared drives (Team Drives) often accumulate years of content without clear ownership or lifecycle management. With include_shared_drives enabled, the scanner automatically discovers every shared drive and indexes its contents, making it easy to identify stale drives, oversized folders, and files that should be archived or deleted based on age, MIME type, or access patterns.
Cost & Capacity Planning (Storage Engineers)
The scanner queries Google Drive's storageQuota API to surface total and used storage, enabling unified capacity reporting across on-premises NAS and cloud-hosted Google Drive. Organizations can forecast storage growth, compare cost per TB across platforms, and make data-driven decisions about tiering or archival policies.
Understanding Google Drive & Google Cloud
Before configuring the scanner, it helps to understand a few Google-specific concepts that shape how the scanner is deployed and what it can see.
Drive Types
Drive Type | Description | How the Scanner Sees It |
|---|---|---|
My Drive | A user's personal drive | Indexed under |
Shared Drive (Team Drive) | Organization-owned drive with shared membership | Auto-discovered and indexed under |
Shared With Me | Files others have shared with the authenticated account | Not indexed as a top-level path; accessible only via the owning drive or folder ID |
Authentication Methods
Method | Best For | Setup Effort | User Interaction |
|---|---|---|---|
Service Account (recommended) | Production / headless / server deployments | Moderate (share drives with service account email) | None after setup |
OAuth 2.0 Desktop | Workstation installs, testing, single-user scans | Low | Browser-based authorization on first run |
Recommendation: For production Diskover deployments running on servers, use a service account. Service accounts don't require interactive browser auth, tokens never expire, and they're easier to audit. OAuth 2.0 is suitable for development or single-user workstation use.
Google Apps File Types
Google Apps files (Docs, Sheets, Slides, etc.) are not traditional files — they live inside Google's infrastructure and have no native byte size. The scanner handles them with estimated sizes and synthetic extensions:
MIME Type | Virtual Extension | Estimated Size |
|---|---|---|
|
| 50 KB |
|
| 100 KB |
|
| 200 KB |
|
| 10 KB |
|
| 50 KB |
|
| 10 KB |
Other Google Apps types |
| 25 KB |
These can be excluded entirely by setting include_google_apps_files: false if you only care about traditional uploaded files (PDFs, images, Office docs, etc.).
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher |
Diskover | Core installation with alternate scanner support |
Google Cloud | Project with Google Drive API enabled |
Network | HTTPS (port 443) access to |
Python Dependencies
Package | Version | Purpose |
|---|---|---|
| 2.110.0+ | Google Drive API v3 client |
| 2.25.2+ | Google authentication library |
| 1.1.0+ | OAuth 2.0 flow support for interactive auth |
| 0.2.0+ | httplib2 transport adapter for google-auth |
| 2.8.0+ | ISO 8601 timestamp parsing |
Google Cloud / Google Drive API Requirements
Requirement | Description |
|---|---|
Google Cloud Project | A project in Google Cloud Console with the Google Drive API enabled |
Authentication Credentials | Either a service account JSON key (recommended) or an OAuth 2.0 Desktop Client ID JSON |
API Scope |
|
Drive Access | For service accounts: each drive/folder must be shared with the service account email. For OAuth: the authenticating user must have access to the target drives |
License Tier | Diskover PRO+ (Professional Edition or higher) |
Installation
Step 1: Install the Scanner Package
Linux:
dnf install diskover-scanner-google
Windows:
The scanner files are included with the Diskover Windows installation — no separate install step is needed.
Install locations:
Linux:
/opt/diskover/scanners/scandir_google/Windows:
C:\Program Files\Diskover\scanners\scandir_google\
Step 2: Install Python Dependencies
pip3 install -r /opt/diskover/scanners/scandir_google/etc/requirements.txt
Step 3: Verify Scanner Module Installation
Confirm the scanner module loads and dependencies resolve correctly:
cd /opt/diskover
python3 -c "import sys; sys.path.insert(0, 'scanners/scandir_google'); import scandir_google; print('Scanner module loaded OK, version:', scandir_google.__version__)"
Step 4: Configure Google Drive API Credentials
Choose one of the two authentication methods below.
Option A: Service Account (Recommended for Production)
Service accounts authenticate without browser interaction and are ideal for Diskover servers.
Go to Google Cloud Console and create (or select) a project.
Enable the Google Drive API: APIs & Services > Library > search "Google Drive API" > Enable.
Create a service account: APIs & Services > Credentials > Create Credentials > Service Account.
After creating the service account, open it and go to the Keys tab > Add Key > Create New Key > JSON. Download the key file.
Copy the key file to the scanner directory:
cp ~/Downloads/<project>-<hash>.json /opt/diskover/scanners/scandir_google/service-account.json
Share the target drives/folders with the service account email (found in the service account details, e.g.,
diskover-scanner@my-project.iam.gserviceaccount.com). The service account can only see drives explicitly shared with it.Set
service_account_file: service-account.jsonin the Diskover Admin configuration (see Configuration).
Important: Service accounts cannot "see" drives by default — each drive, shared drive, or folder you want scanned must be explicitly shared with the service account's email address (Viewer permission is sufficient).
Option B: OAuth 2.0 (Interactive, Desktop Installs)
Use OAuth for ad-hoc scans from a workstation or for single-user deployments.
In Google Cloud Console, go to APIs & Services > Credentials.
Click Create Credentials > OAuth 2.0 Client ID > Desktop application.
Download the JSON file and save it as
credentials.jsonin the scanner directory:cp ~/Downloads/client_secret_*.json /opt/diskover/scanners/scandir_google/credentials.json
On first run, a browser window opens asking you to grant read-only access to Google Drive. After approval, a
token.jsonfile is written for subsequent runs.If running on a headless system, the scanner falls back to a manual URL-paste flow — visit the printed URL on another machine, approve, and paste the resulting code back into the terminal.
Step 5: Verify Google Drive Connectivity
Test that authentication works and the Drive API is reachable:
cd /opt/diskover
python3 -c "
import sys
sys.path.insert(0, 'scanners/scandir_google')
from google_scanner.drive_client.auth import GoogleDriveAuth
auth = GoogleDriveAuth(
'scanners/scandir_google/credentials.json',
'scanners/scandir_google/token.json',
service_account_file='scanners/scandir_google/service-account.json' # remove if using OAuth only
)
if auth.authenticate():
svc = auth.get_service()
about = svc.about().get(fields='user,storageQuota').execute()
print('Authenticated as:', about['user']['emailAddress'])
quota = about.get('storageQuota', {})
print('Storage usage:', int(quota.get('usage', 0)) // (1024**3), 'GB')
else:
print('Authentication failed')
"
Configuration
Configuration is managed through the Diskover Admin UI under Settings > Alternate Scanners > Google Drive.
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| string |
| Path to the OAuth 2.0 credentials file (relative to the scanner directory, or absolute) |
| string |
| Path to the OAuth 2.0 token cache file (relative to the scanner directory, or absolute) |
| string |
| Path to the service account JSON key file for headless auth (empty string disables service account auth) |
| bool |
| Auto-discover and index all accessible shared drives (Team Drives) as additional top paths |
| bool |
| Include Google Apps native files (Docs, Sheets, Slides, etc.) with estimated sizes and virtual extensions |
| int |
| Timeout in seconds for each Google Drive API request (per-thread) |
| int |
| Proactive API rate limit (requests/minute). Google's default quota is 12,000 — leave headroom to avoid 429 errors |
| int |
| Number of background worker threads that pre-fetch directory listings in parallel with the crawl |
Advanced tuning:
request_limit_per_minuteandprefetch_workersrarely need adjustment. Increaseprefetch_workerscautiously on large drives; decreaserequest_limit_per_minuteif you share the Google Cloud project quota with other applications.
Configuration via Diskover Admin
Navigate to Settings > Alternate Scanners > Google Drive.
Set the credential file paths and choose your authentication method.
Enable or disable shared drives and Google Apps file inclusion.
Adjust timeout and rate-limit values if needed (defaults are appropriate for most deployments).
Save the configuration.
Configuration Examples
Example 1: Service Account on a Production Server (Recommended)
Diskover:
Alternate Scanners:
Google Drive:
Default:
credentials_file: ""
token_file: ""
service_account_file: service-account.json
include_shared_drives: true
include_google_apps_files: true
request_timeout: 120
request_limit_per_minute: 10000
prefetch_workers: 10
Use this when Diskover runs on a headless server and you've shared all target drives/folders with the service account email.
Example 2: OAuth for a Workstation / Single-User Scan
Diskover:
Alternate Scanners:
Google Drive:
Default:
credentials_file: credentials.json
token_file: token.json
service_account_file: ""
include_shared_drives: false
include_google_apps_files: true
request_timeout: 120
request_limit_per_minute: 10000
prefetch_workers: 10
Use this for ad-hoc scans from a workstation where a browser is available for first-time authorization.
Example 3: High-Volume Shared Drive Scan, Google Apps Excluded
Diskover:
Alternate Scanners:
Google Drive:
Default:
credentials_file: ""
token_file: ""
service_account_file: service-account.json
include_shared_drives: true
include_google_apps_files: false
request_timeout: 180
request_limit_per_minute: 8000
prefetch_workers: 15
Use this for large tenants where you only care about uploaded files (PDFs, images, Office docs) and not Google Apps native documents. Reduced request_limit_per_minute leaves quota headroom for other Google API consumers.
Usage / Execution
The Google Drive scanner runs as a standard alternate scanner via the --altscanner flag with diskover.py.
Basic Usage
Scan the entire personal drive (root path is /gdrive):
# Linux cd /opt/diskover python3 diskover.py --altscanner scandir_google /gdrive # Windows cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_google /gdrive
Scan a Specific Subfolder
python3 diskover.py --altscanner scandir_google /gdrive/Projects/2025
Scan by Google Drive Folder ID
When a folder name is ambiguous or hard to spell, use the gdrive-id:// URI to target it directly by its Drive folder ID:
python3 diskover.py --altscanner scandir_google gdrive-id://1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms
The folder ID is resolved to its name at initialization, and the index path becomes /gdrive/<folder_name>. The folder ID appears in the URL when you open a folder in the Google Drive web UI.
Include Shared Drives Automatically
With include_shared_drives: true in the configuration, scanning /gdrive automatically discovers and indexes every accessible shared drive as an additional top path (/gdrive/<drive_name>):
python3 diskover.py --altscanner scandir_google /gdrive
Path Format Reference
Path Format | Description | Example |
|---|---|---|
| Scan the entire personal drive (and shared drives if enabled) |
|
| Scan a specific subfolder by name |
|
| Scan a specific folder by its Google Drive ID |
|
Advanced Usage Examples
Custom index name:
python3 diskover.py -i diskover-gdrive-2026 --altscanner scandir_google /gdrive
Verbose/debug logging:
python3 diskover.py --altscanner scandir_google --loglevel DEBUG /gdrive
Reduced thread count (lower API concurrency):
python3 diskover.py --altscanner scandir_google --threads 4 /gdrive
Integration with Index Tasks
The Google Drive scanner can be configured as part of a Diskover Index Task for scheduled/automated scans:
Field | Value |
|---|---|
Alternate Scanner |
|
Top Path |
|
Custom Index Name | Optional (e.g., |
Performance Tips
Start with defaults. The built-in rate limiter and prefetcher are tuned for typical workloads; adjust only if you see issues.
Large tenants: Increase
prefetch_workersto 15–20 for drives with thousands of subfolders. The background prefetcher firesfiles.listrequests for child directories as soon as they're discovered, reducing wait time when the crawl pool reaches them.Shared quota environments: If other applications use the same Google Cloud project, reduce
request_limit_per_minuteto 6,000–8,000 to leave headroom.Thread tuning: The
--threadsflag controls Diskover's crawl thread count. For Google Drive, 4–8 threads is usually optimal; higher counts may trigger rate limiting without improving throughput.Excluding Google Apps files: If you only care about uploaded files, setting
include_google_apps_files: falsecan significantly speed up scans of tenants with large numbers of Docs/Sheets.
Metadata Fields / Elasticsearch Mappings
In addition to standard file metadata (name, path, size, mtime, etc.), the scanner indexes four custom Google Drive-specific fields into Elasticsearch.
Field Mappings
Field Path | ES Type | Description |
|---|---|---|
|
| Unique Google Drive file or folder ID |
|
| MIME type from the Drive API (e.g., |
|
| Direct URL to open the file in the Google Drive web UI |
|
| Name of the drive containing the file (e.g., |
Elasticsearch Mapping Definition
{
"mappings": {
"properties": {
"gdrive_id": { "type": "keyword" },
"gdrive_mimetype": { "type": "keyword" },
"gdrive_webviewlink": { "type": "keyword" },
"gdrive_drivename": { "type": "keyword" }
}
}
}
Example Indexed Document — Regular File
{
"name": "quarterly_report.pdf",
"path": "/gdrive/Finance/Reports/quarterly_report.pdf",
"path_parent": "/gdrive/Finance/Reports",
"extension": ".pdf",
"size": 2097152,
"size_du": 2097152,
"mtime": "2026-01-15T10:30:00",
"atime": "2026-01-15T10:30:00",
"ctime": "2025-06-01T08:00:00",
"type": "file",
"gdrive_id": "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms",
"gdrive_mimetype": "application/pdf",
"gdrive_webviewlink": "https://drive.google.com/file/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/view",
"gdrive_drivename": "My Drive"
}
Example Indexed Document — Google Apps File
{
"name": "Meeting Notes",
"path": "/gdrive/Team/Meeting Notes",
"path_parent": "/gdrive/Team",
"extension": ".gdoc",
"size": 51200,
"size_du": 51200,
"mtime": "2026-03-01T14:00:00",
"atime": "2026-03-01T14:00:00",
"ctime": "2025-10-15T09:00:00",
"type": "file",
"gdrive_id": "abc123def456ghi789",
"gdrive_mimetype": "application/vnd.google-apps.document",
"gdrive_webviewlink": "https://docs.google.com/document/d/abc123def456ghi789/edit",
"gdrive_drivename": "My Drive"
}
Searching in Diskover
Use the Diskover Web UI search bar to query the custom Google Drive fields alongside standard file metadata.
Search Query Examples
Query | Description |
|---|---|
| Find all PDF files in Google Drive |
| Find all Google Docs |
| Find all Google Apps files (Docs, Sheets, Slides, Forms, etc.) |
| Files from personal drives only |
| Files in a specific shared drive |
| Files in any shared drive (wildcard on drive name) |
| Look up a specific file by its Drive ID |
| All files that have a Drive web view link |
Combined Queries
Query | Description |
|---|---|
| Large Google Sheets (> 100 KB estimated) |
| PDFs in personal drives older than a year |
| Google Apps files not modified in 2+ years (archival candidates) |
| Files > 1 GB in any shared drive |
| All images in the Marketing shared drive |
Troubleshooting
Issue | Cause | Solution |
|---|---|---|
Authentication fails | Missing/expired credentials or Drive API not enabled | Verify |
Service account can't see drives | Drives not shared with the service account email | Share each target drive or folder with the service account email (Viewer permission is sufficient). |
HTTP 429 rate limit errors | Too many concurrent API requests | Reduce |
"Google Drive folder not found" | Folder name typo or case mismatch | Folder names are case-sensitive. Use |
API timeout errors | Network latency or large listings | Increase |
Folder names with slashes | Slashes ( | Expected behavior — Google Drive allows |
Empty config values cause path errors | Legacy config with empty | The scanner now applies default filenames ( |
Shared drives not indexed |
| Set |
Google Apps files have wrong sizes | Estimated sizes are used | Expected behavior — Google Apps files have no byte size in the API. Set |
Debug Logging
Enable debug-level logging to trace authentication, API calls, and folder resolution:
python3 /opt/diskover/diskover.py --altscanner scandir_google --loglevel DEBUG /gdrive
Log File Locations
Linux:
/var/log/diskover/diskover.logWindows: Check Diskover service logs or the configured log location in Diskover Admin
Diagnostic Commands
Test authentication directly:
cd /opt/diskover
python3 -c "
import sys
sys.path.insert(0, 'scanners/scandir_google')
from google_scanner.drive_client.auth import GoogleDriveAuth
auth = GoogleDriveAuth(
'scanners/scandir_google/credentials.json',
'scanners/scandir_google/token.json'
)
print('Auth result:', auth.authenticate())
"
Check for rate limit warnings in a recent run:
python3 diskover.py --altscanner scandir_google --loglevel DEBUG /gdrive 2>&1 | grep -i "rate limit"
Test Google API reachability:
curl -s -o /dev/null -w "HTTP %{http_code}
" "https://www.googleapis.com/drive/v3/about"
2026-05-05 15:16:49,802 - diskover - INFO - Using alternate scanner scandir_google 2026-05-05 15:16:49,911 - google_scanner.drive_client.auth - INFO - Successfully authenticated using service account: /opt/diskover/scanners/scandir_google/service-account.json 2026-05-05 15:16:49,911 - diskover.scandir_google - INFO - Successfully authenticated with Google Drive API
Support
Last Updated: April 24, 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.