* initial telemetry manager impl
* adding todo
* updates
* updates
* progress on telemetry: config load, process, model load, train start / end, error tracking
* update error file path sanitization function; adding more error tracking
* updated sanitization logic, tests
* adding runtime metrics (cpu + gpu memory, steps/s, etc.)
* tests for runtime metrics telemetry and assoc. callback
* small update / fix
* simplifying path redaction
* sleep on all ranks in distributed setting
* adding back in base_model redaction w/ whitelist
* fix
* doc update
* improved redaction, send system info during model config load telemetry, etc.
* adding runtime metrics / system info additional accelerator support, etc.
* adding runtime metrics / system info additional accelerator support, etc.
* remove duplicate info
* fixes
* fix issue with tests in ci
* distributed fix
* opt-in version of telemetry
* enable / disable logic update
* docs fix
* doc update
* minor fixes
* simplifying
* slight changes
* fix
* lint
* update posthog dep
* coderabbit comments
* fix: opt-in model
* fix: increase time since last
* fix: increase whitelist orgs
* fix: posthog init and shutdown
* fix: imports
* fix: also check grad norm
* fix: duplicate plugin_manager calls
* fix: bad merge
* chore: update docs
* fix: cache process per comment
* fix: error handling
* fix: tests
* Revert "fix: error handling"
This reverts commit 22d1ea5755.
* fix: test telemetry error_handled bool
* fix: revert test
* chore: final doc fixes
---------
Co-authored-by: Dan Saunders <danjsaund@gmail.com>
Co-authored-by: Dan Saunders <dan@axolotl.ai>
62 lines
2.4 KiB
Plaintext
62 lines
2.4 KiB
Plaintext
---
|
|
title: Telemetry
|
|
description: A description of the telemetry implementation in Axolotl.
|
|
---
|
|
|
|
# Telemetry in Axolotl
|
|
|
|
Axolotl implements anonymous telemetry to help maintainers understand how the library
|
|
is used and where users encounter issues. This data helps prioritize features, optimize
|
|
performance, and fix bugs.
|
|
|
|
## Data Collection
|
|
|
|
We collect:
|
|
|
|
- System info: OS, Python version, Axolotl version, PyTorch version, Transformers
|
|
version, etc.
|
|
- Hardware info: CPU count, memory, GPU count and models
|
|
- Runtime metrics: Training progress, memory usage, timing information
|
|
- Usage patterns: Models (from a whitelist) and configurations used
|
|
- Error tracking: Stack traces and error messages (sanitized to remove personal
|
|
information)
|
|
|
|
Personally identifiable information (PII) is not collected.
|
|
|
|
## Implementation
|
|
|
|
Telemetry is implemented using PostHog and consists of:
|
|
|
|
- `axolotl.telemetry.TelemetryManager`: A singleton class that initializes the
|
|
telemetry system and provides methods for tracking events.
|
|
- `axolotl.telemetry.errors.send_errors`: A decorator that captures exceptions and
|
|
sends sanitized stack traces.
|
|
- `axolotl.telemetry.runtime_metrics.RuntimeMetricsTracker`: A class that tracks
|
|
runtime metrics during training.
|
|
- `axolotl.telemetry.callbacks.TelemetryCallback`: A Trainer callback that sends
|
|
runtime metrics telemetry.
|
|
|
|
The telemetry system will block training startup for 10 seconds to ensure users are
|
|
aware of data collection, unless telemetry is explicitly enabled or disabled.
|
|
|
|
## Opt-Out Mechanism
|
|
|
|
Telemetry is **enabled by default** on an opt-out basis. To disable it, set
|
|
`AXOLOTL_DO_NOT_TRACK=1` or `DO_NOT_TRACK=1`.
|
|
|
|
A warning message will be logged on start to clearly inform users about telemetry.
|
|
We will remove this after some period.
|
|
|
|
To hide the warning message about telemetry that is displayed on train, etc. startup,
|
|
explicitly set: `AXOLOTL_DO_NOT_TRACK=0` (enable telemetry) or `AXOLOTL_DO_NOT_TRACK=1`
|
|
(explicitly disable telemetry).
|
|
|
|
## Privacy
|
|
|
|
- All path-like config information is automatically redacted from telemetry data
|
|
- Model information is only collected for whitelisted organizations
|
|
- See `axolotl/telemetry/whitelist.yaml` for the set of whitelisted organizations
|
|
- Each run generates a unique anonymous ID
|
|
- This allows us to link different telemetry events in a single same training run
|
|
- Telemetry is only sent from the main process to avoid duplicate events
|