Operations
This section is for running your implementation in a controlled way after deployment.
Operational pillars​
- Data quality checks.
- Execution monitoring.
- Failure triage.
- Replay and recovery.
- Environment-safe troubleshooting.
Data quality​
Data quality is enforced both by contract-aware validation and by the way Silver quarantine behavior is reported.
- Track failure reasons, not just failure counts.
- Distinguish contract authoring failures from row-level transform failures.
- Treat invalid
transformLogicas a pipeline-authoring problem and mixed row failures as quarantine candidates. - Separate casting failures from constraint failures.
- Make quarantine tables visible to operators, not just developers.
Monitoring​
- Monitor pipeline completion, duration, input volume, and quarantine rate.
- Keep operational logging attached to the data plane, not buried in a separate dashboard with no context.
- Prefer a small set of trusted signals over many unactionable metrics.
Common recovery runbooks​
Key Vault soft-delete conflict​
If redeployment fails because a vault already exists in deleted state, check whether the environment expects recovery or purge-safe behavior before forcing a new name.
- List deleted vaults first.
- Recover when purge protection is enabled.
- Treat the recovery path as part of the environment lifecycle, not a one-off operator trick.
Databricks metastore assignment failures​
When metastore creation succeeds but workspace assignment fails:
- Verify the workspace identifier and access path.
- Verify the deployment identity has the required Databricks account permissions.
- Document any required manual recovery only after confirming the normal deployment path cannot complete it.
Python dependency or import failures in pipelines​
- Verify the dependency installation step ran in the expected directory.
- Verify the script is executed through the intended environment tooling.
- Prefer fixing the install or packaging path rather than adding ad hoc pipeline workarounds.
Bicep validation or policy failures​
- Build or validate the affected template locally.
- Recheck naming, tagging, and policy expectations before retrying deployment.
- Capture the fix back into docs when the failure exposed a missing guardrail.
Fabric operations focus​
- Emphasize notebook execution context, workspace permissions, and lakehouse object visibility.
- Track which operational tasks are fully implemented versus documented placeholders.
- Keep Fabric-specific runbooks explicit until parity improves.
Troubleshooting sequence​
- Confirm the failing layer: infra, orchestration, landing, Bronze, or Silver.
- Confirm whether the behavior is shared or platform-specific.
- Inspect contract and checkpoint assumptions before changing code.
- Capture the remediation steps in durable documentation or runbooks.