OneDrive / SharePoint
License: PRO+ (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview / Use Cases
The Diskover OneDrive/SharePoint Scanner brings your Microsoft 365 cloud storage into Diskover, giving you the same powerful search, analysis, and reporting capabilities you rely on for on-premises storage — but for OneDrive user drives and SharePoint site document libraries.
The scanner connects to the Microsoft Graph API to enumerate files and folders across your entire Microsoft 365 tenant (or targeted subsets of users and sites), then indexes all that metadata into Elasticsearch. Once indexed, the data appears in the Diskover Web UI alongside any other indexed storage, letting you search, filter, tag, and report on cloud-stored content just like local files.
Who Is This For?
IT Administrators & Storage Engineers
Cloud storage governance: Get a unified view of storage consumption across all OneDrive user drives and SharePoint sites in your tenant. Identify who's consuming the most storage, find oversized files, and understand growth patterns — without logging into the Microsoft 365 Admin Center and checking accounts one by one.
SharePoint site sprawl management: Microsoft 365 tenants tend to accumulate SharePoint sites over time. Many go dormant or are abandoned entirely. Use this scanner to inventory every site, see which ones contain stale or minimal data, and make informed decisions about decommissioning or consolidating sites.
Data Managers & Compliance Officers
Data migration and archive planning: Planning a migration away from Microsoft 365 or archiving older content? This scanner provides the detailed inventory you need — file counts, sizes, ages, and ownership — so you can scope the effort accurately and prioritize what moves first.
Hybrid cloud storage analysis: If your organization stores data across both on-premises filesystems and Microsoft 365, running this scanner alongside Diskover's filesystem scanners gives you a single view of storage everywhere, enabling consistent reporting and cost analysis.
Sample scan data from a SharePoint Site.
Here we can see the SharePoint Site Name ID, this can be used to map back to the SharePoint site these files belong to!
Understanding Microsoft Graph API
The OneDrive/SharePoint scanner communicates with Microsoft 365 through the Microsoft Graph API — Microsoft's unified REST API for accessing data across their cloud services. Understanding a few key concepts will help you configure and operate the scanner effectively.
Authentication: OAuth 2.0 Client Credentials
The scanner uses the OAuth 2.0 client credentials flow, which is designed for server-to-server communication without user interaction. Instead of signing in as a specific user, the scanner authenticates as an application registered in your Azure Active Directory (Azure AD) tenant. This means:
No user needs to be logged in for the scanner to run
The scanner accesses data based on application-level permissions, not a specific user's access
A client secret (essentially a password for the application) is used to authenticate
Application Permissions vs. Delegated Permissions
Microsoft Graph supports two permission models:
Permission Type | How It Works | Used By This Scanner? |
|---|---|---|
Application | Grants access to all resources of a given type (e.g., all users' files) without a signed-in user | Yes |
Delegated | Grants access on behalf of a specific signed-in user, limited to what that user can see | No |
The scanner requires Application permissions because it needs to read files across all users and sites in the tenant without requiring anyone to be signed in.
API Throttling
Microsoft enforces rate limits on Graph API requests to protect service stability. If you send too many requests too quickly, the API will respond with 429 Too Many Requests and ask you to slow down. The scanner handles this automatically with built-in rate limiting and exponential backoff retry logic, but you should be aware of it — especially in large tenants with thousands of users. The Configuration section covers how to tune these settings.
OneDrive vs. SharePoint: What Gets Scanned
Scan Mode | Command | What It Covers |
|---|---|---|
Users ( |
| Each user's personal OneDrive drive — the "My Files" content tied to their Microsoft 365 account |
Sites ( |
| SharePoint site document libraries — shared team content, project sites, communication sites |
You can run both scan modes to get a complete picture, indexing them into separate Elasticsearch indices or combining them as needed.
Microsoft Graph OData Filters
The user_filter and site_filter configuration fields in Diskover Admin support Graph API OData filter syntax, which lets you target specific subsets of users or sites without scanning the entire tenant. Some common examples:
Filter | What It Does |
|---|---|
| Users whose email starts with "j" |
| Users in the Engineering department |
| A specific SharePoint site by URL |
These filters are passed directly to the Microsoft Graph API, so any valid OData filter syntax that Graph supports will work.
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.11+ (included with Diskover installation) |
Diskover | Core installation with alternate scanner support and Elasticsearch cluster |
Network | HTTPS access to |
Operating System | Linux or Windows |
Python Dependencies
Package | Version | Purpose |
|---|---|---|
| 3.9+ | Async HTTP client for Microsoft Graph API requests |
| 2.9+ | Date/time parsing for file timestamps |
Microsoft Azure Requirements
Requirement | Description |
|---|---|
Azure AD Tenant | An active Microsoft 365 tenant with Azure Active Directory |
App Registration | An Azure AD application registered with client credentials |
API Permissions |
|
Client Credentials | Application (client) ID and a client secret |
Important: All three API permissions must be granted as Application permissions (not Delegated), and an Azure AD administrator must click "Grant admin consent" for them to take effect.
Installation
Step 1: Install the Scanner Package
Linux (RPM-based):
dnf install diskover-scanner-onedrive
Windows:
The scanner files are included with the Diskover Windows installation. No separate installation step is required.
Install Locations:
Platform | Path |
|---|---|
Linux |
|
Windows |
|
Step 2: Install Python Dependencies
Linux:
cd /opt/diskover/scanners/scandir_onedrive python3 -m pip install -r requirements.txt
Windows:
cd "C:\Program Files\Diskover\scanners\scandir_onedrive" python -m pip install -r requirements.txt
Step 3: Create an Azure AD App Registration
This step registers an application identity in your Azure AD tenant that the scanner will use to authenticate.
Sign in to the Azure Portal
Navigate to Azure Active Directory → App registrations → New registration
Fill in the registration form:
Name:
Diskover OneDrive Scanner(or your preferred name)Supported account types: "Accounts in this organizational directory only"
Redirect URI: Leave blank (not needed for client credentials flow)
Click Register
Step 4: Configure API Permissions
In your new app registration, go to API permissions → Add a permission
Select Microsoft Graph → Application permissions
Search for and add each of the following:
Files.Read.All— Read all files in all site collectionsUser.Read.All— Read all users' full profilesSites.Read.All— Read items in all site collections
Click Grant admin consent for [Your Tenant]
Verify that all three permissions display a green checkmark with "Granted" status
Note: Admin consent is required because these are Application-level permissions that access data across the entire tenant. A Global Administrator or Privileged Role Administrator must grant this consent.
Step 5: Create a Client Secret
In your app registration, go to Certificates & secrets → Client secrets → New client secret
Add a description (e.g., "Diskover Scanner") and choose an expiration period
Click Add
Immediately copy the secret value — it will not be displayed again after you leave this page
Tip: Set a calendar reminder before the secret expires. An expired secret will cause authentication failures that can be confusing to troubleshoot. You can always create a new secret and update your scanner configuration before the old one expires.
Step 6: Gather Your Azure Credentials
You'll need four values from your Azure portal to configure the scanner:
Value | Where to Find It |
|---|---|
Tenant ID | Azure AD → App registrations → [Your App] → Overview |
Client ID (Application ID) | Azure AD → App registrations → [Your App] → Overview |
Client Secret | The value you copied in Step 5 |
Tenant Name | Your Microsoft 365 domain (e.g., |
Step 7: Configure Scanner Settings
The scanner configurations are done inside Diskover Admin under Settings → Alternate Scanners → OneDrive.
Edit the Diskover Admin UI with your Azure connection details:
Step 8: Verify Connectivity
You can verify that your app registration credentials work by testing the Microsoft Graph API directly. From the Diskover server, run:
curl -X POST \
"https://login.microsoftonline.com/{your-tenant-id}/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id={your-client-id}&scope=https://graph.microsoft.com/.default&client_secret={your-client-secret}&grant_type=client_credentials"
A successful response returns a JSON object with an access_token. If you receive an error, the issue is with your Azure credentials — not the scanner.
Configuration
Configuration Approach
All configuration for the OneDrive/SharePoint scanner is managed through Diskover Admin under Settings → Alternate Scanners → OneDrive. There are no configuration files to edit on the server.
Here is the beginning of our default configuration.There are many other configuraitons for the OneDrive/SharePoint Scanner - covered in detail below!
Azure Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| string | Required | Azure AD tenant ID (GUID format) |
| string | Required | Azure AD application (client) ID |
| string | Required | Azure AD client secret value |
| string | (empty) | Your Microsoft 365 tenant domain name (e.g., |
| string |
| OAuth 2.0 scopes for Graph API access. Normally leave as default. |
Scan Target Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| select |
| What to scan: |
| string | (empty) | Optional Graph API |
| list | (empty) | Specific |
| string | (empty) | Optional Graph API |
| list | (empty) | Specific site names to scan. If set, overrides |
Rate Limiting Parameters
These settings control how aggressively the scanner calls the Microsoft Graph API. The defaults work well for small-to-medium tenants. For large tenants or environments where you encounter throttling, reduce these values.
Parameter | Type | Default | Description |
|---|---|---|---|
| integer |
| Maximum Graph API requests per minute |
| integer |
| OAuth token refresh interval in seconds (default 45 min, tokens expire at 60 min) |
Prefetch Parameters
The scanner uses a background prefetch worker to concurrently fetch directory listings from the Microsoft Graph API, providing significant speedup for large filesystems.
Parameter | Type | Default | Description |
|---|---|---|---|
| integer |
| Maximum parallel prefetch requests to Graph API. Higher values speed up scanning but use more connections. |
| integer |
| Maximum directory listings to cache in memory. Controls memory usage on large filesystems. |
Graph API Retry Parameters
When the Microsoft Graph API returns errors (429 throttling, 5xx server errors), the scanner retries with exponential backoff. These parameters control that behavior.
Parameter | Type | Default | Description |
|---|---|---|---|
| integer |
| Maximum retry attempts for a failed Graph API request |
| integer |
| Initial wait time in seconds before the first retry |
| integer |
| Maximum wait time in seconds between retries (1 hour ceiling) |
| integer |
| Multiplier for exponential backoff between retries |
Configuration Examples
Standard Configuration (Small Tenant, Under ~100 Users)
The default settings work well for most small tenants. No changes needed beyond filling in your Azure credentials and selecting the appropriate scan_mode.
Conservative Configuration (Large Tenant, 1,000+ Users)
For large tenants, reduce the rate limit to avoid triggering Microsoft Graph throttling. The scan will take longer, but it will complete reliably without repeated 429 errors.
Set request_limit_per_minute to 600, max_retries to 30, base_wait to 30, and backoff_factor to 10.
Usage / Execution
The OneDrive/SharePoint scanner uses the standard --altscanner flag with diskover.py.
Scanning OneDrive User Drives
Scan All Users
Set scan_mode to Users in Diskover Admin, then run:
# Linux cd /opt/diskover python3 diskover.py -i <INDEX-NAME> --altscanner scandir_onedrive od:///users # Windows cd "C:\Program Files\Diskover" python diskover.py -i <INDEX-NAME> --altscanner scandir_onedrive od:///users
Scan Specific Users
To target one or more users by their UserPrincipalName (email address), add them to the user_list field in Diskover Admin, then run the scan as above.
Filter Users with a Graph API Query
Use OData filter syntax to select users by attribute. Enter the filter in the user_filter field in Diskover Admin:
Users whose email starts with "s":
startswith(userPrincipalName,'s')Users in the Engineering department:
department eq 'Engineering'
Note: You cannot combine
user_list(specific users) anduser_filter(query filter) — ifuser_listis populated, it takes precedence.
Scanning SharePoint Sites
Scan All Sites
Set scan_mode to Sites in Diskover Admin, then run:
# Linux cd /opt/diskover python3 diskover.py -i <INDEX-NAME> --altscanner scandir_onedrive od:///sharepoint # Windows cd "C:\Program Files\Diskover" python diskover.py -i <INDEX-NAME> --altscanner scandir_onedrive od:///sharepoint
Scan Specific Sites by Name
Add site names to the site_list field in Diskover Admin (e.g., "Marketing", "Engineering"), then run the scan as above.
Filter Sites with a Graph API Query
Enter a filter expression in the site_filter field in Diskover Admin:
webUrl eq 'https://contoso.sharepoint.com/sites/ProjectAlpha'
Path Format Reference
Path Format | Description | Example |
|---|---|---|
| Virtual root for all OneDrive user drives | Files appear as |
| Virtual root for all SharePoint sites | Files appear as |
Advanced Usage Examples
Custom Index Name
python3 diskover.py --altscanner scandir_onedrive -i <INDEX-NAME> od:///users
Debug Logging
python3 diskover.py --altscanner scandir_onedrive -i <INDEX-NAME> --debug od:///users
Integration with Index Tasks
The OneDrive/SharePoint scanner can be configured as part of a Diskover Index Task for scheduled or recurring scans.
Field | Value |
|---|---|
Scanner |
|
Crawl Directory |
Path will depend on your set up |
Sample One Drive Storage Scan - Task Configuration:
Here we can see a basic setup for an One Drive Scan! The setup is very similar when doing a scan of a SharePoint Site..
Performance Tips
Start with a small scan to validate your configuration — scan a single user or site (using
user_listorsite_list) before running a full tenant scan.Use
--debugfor production scans. The overhead is minimal, and the additional logging is invaluable if something goes wrong partway through a long scan.Schedule scans during off-hours to minimize impact on Graph API quotas shared with other applications in your tenant.
Metadata Fields / Elasticsearch Mappings
The scanner indexes files and directories with standard Diskover document fields, plus one scanner-specific field.
Custom Field
Field Path | ES Type | Description |
|---|---|---|
| keyword | The Microsoft Graph item ID for the file or folder. Uniquely identifies the item in OneDrive/SharePoint. |
Standard Fields (Files)
Field Path | ES Type | Description |
|---|---|---|
| keyword | File name |
| keyword | Lowercase file extension (e.g., |
| keyword | Full virtual path to the parent directory |
| long | File size in bytes |
| long | Disk usage size in bytes (same as |
| keyword | Display name of the file creator |
| keyword | Always |
| date | Last modified timestamp |
| date | Last accessed timestamp (set to |
| date | Creation timestamp |
| keyword | Microsoft Graph API item ID (unique identifier) |
| keyword | Always |
Standard Fields (Directories)
Field Path | ES Type | Description |
|---|---|---|
| keyword | Directory name |
| keyword | Full virtual path to the parent directory |
| long | Total recursive size of all descendant files |
| long | Size of immediate child files only |
| long | Total recursive file count |
| long | Immediate child file count |
| long | Total recursive subdirectory count |
| long | Immediate child subdirectory count |
| integer | Depth relative to user/site root |
| keyword | Always |
Example Indexed Document (File)
{
"name": "Q4-Report.xlsx",
"extension": ".xlsx",
"parent_path": "/onedrive/users/jdoe/Documents/Reports",
"size": 245760,
"size_du": 245760,
"owner": "John Doe",
"group": "0",
"mtime": "2025-12-15T14:30:00",
"atime": "2025-12-15T14:30:00",
"ctime": "2025-11-01T09:00:00",
"nlink": 0,
"ino": "01ABCDEF12345678",
"type": "file"
}
Example Indexed Document (File from a SharePoint Site Scan)
{
"name": "Brand-Guidelines.pdf",
"extension": ".pdf",
"parent_path": "/sharepoint/sites/Marketing/Shared Documents/Brand",
"size": 5242880,
"size_du": 5242880,
"owner": "Jane Smith",
"group": "0",
"mtime": "2025-09-20T11:15:00",
"atime": "2025-09-20T11:15:00",
"ctime": "2025-06-10T08:00:00",
"nlink": 0,
"ino": "01ZYXWVU87654321",
"onedrive_item_id": "Marketing",
"type": "file"
}
Searching in Diskover
Once the scanner has indexed your OneDrive and SharePoint data, you can search for it in the Diskover Web UI using standard search syntax. Here are some practical query examples.
Search by SharePoint Site Name
The onedrive_item_id field is available on all documents from SharePoint site scans, making it easy to scope searches to a specific site.
Query | Description |
|---|---|
| All files and folders in the "Marketing" SharePoint site |
| Only PowerPoint files in the Marketing site |
| All folders in the Engineering site |
Find Large Files
Query | Description |
|---|---|
| Files larger than 100 MB |
| Files larger than 1 GB |
| OneDrive files larger than 50 MB |
| Marketing site files larger than 10 MB |
Find Old or Stale Files
Query | Description |
|---|---|
| Files not modified since before 2024 |
| Files not modified in over 2 years |
| Stale SharePoint files across all sites |
| Files created and last modified before 2023 |
Combined Queries
Query | Description |
|---|---|
| Large video files in the Marketing site |
| All Excel files in John Doe's OneDrive |
| Old temporary files (cleanup candidates) |
| Empty directories |
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
Scanner fails immediately with "Authentication failed" or 401 error | Incorrect Azure credentials | Verify |
"Access denied" or 403 errors when accessing users or sites | Missing or unapproved API permissions | Confirm all three permissions ( |
Scanner slows down significantly or logs show 429 errors | Microsoft Graph API throttling | Reduce |
"Drive not found" for specific users | User does not have a OneDrive license or their OneDrive is not provisioned | This is expected for unlicensed users. Verify in the Microsoft 365 Admin Center that the user has a license that includes OneDrive. |
Specific users or sites missing from results | Incorrect UPN or site name, or personal sites being filtered | Verify the exact |
| Path doesn't start with | Ensure the top path argument uses the |
Debug Logging
For troubleshooting, enable debug logging to get detailed operational output:
# Linux python3 /opt/diskover/diskover.py --altscanner scandir_onedrive --debug od:///users # Windows python "C:\Program Files\Diskover\diskover.py" --altscanner scandir_onedrive --debug od:///users
Log File Locations
Linux:
/var/log/diskover/diskover.logWindows: Check Diskover service logs or the configured log location
Verifying Azure Authentication
You can test your Azure credentials independently using curl:
curl -X POST \
"https://login.microsoftonline.com/{your-tenant-id}/oauth2/v2.0/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id={your-client-id}&scope=https://graph.microsoft.com/.default&client_secret={your-client-secret}&grant_type=client_credentials"
A successful response returns a JSON object with an access_token. If you receive an error, the issue is with your Azure credentials — not the scanner.
Migration Note: From settings.toml to Diskover Admin
If you previously configured the OneDrive/SharePoint scanner using settings.toml or .secrets.toml files on the server, those files are no longer used. All configuration has moved to Diskover Admin under Settings → Alternate Scanners → OneDrive. You'll need to re-enter your Azure AD credentials and scan settings in the Admin panel.
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.