FAQ — MLflow Course

How does MLflow handle concurrent writes from multiple training jobs?

▼

For database-backed stores (PostgreSQL, MySQL), concurrent writes work correctly thanks to database transaction isolation. Each run uses a unique UUID, so there are no conflicts. The file-based backend store lacks proper locking and can corrupt data under concurrent writes — always use a database store for team usage.

For artifact uploads, configure direct client-to-store uploads (S3, GCS) to avoid bottlenecking through the tracking server.

What is the best way to organize experiments for a large team?

▼

Use a naming convention like team-name/project-name/model-type. MLflow experiments are flat (no hierarchy), so naming provides logical grouping. Use one experiment per model being developed, not per person. Add tags for author, branch, and purpose.

For multi-team setups, use separate MLflow instances or Databricks with workspace isolation.

How should I handle large model artifacts (multi-GB)?

▼

Use object storage (S3, GCS, Azure Blob) for the artifact store. Configure direct client-to-store uploads to bypass the tracking server. MLflow doesn't deduplicate artifacts, so log only the best models based on a validation threshold.

For extremely large models (LLMs), consider logging a URI reference rather than uploading through MLflow.

How does autologging interact with custom logging?

▼

They coexist in the same run. Autologging captures framework parameters and metrics automatically; manual calls add additional information. The only conflict is duplicate parameter names — parameters are immutable, so logging the same key with a different value raises an error.

To avoid conflicts, use autologging for framework params and manual logging for custom metrics.

What happens if the tracking server goes down during training?

▼

Log calls will fail with connection errors. Mitigations: (1) Enable async logging to buffer and retry. (2) Set a local fallback tracking URI. (3) Wrap MLflow calls in try/except. (4) Run multiple stateless server instances behind a load balancer.

The model training itself is unaffected — only the tracking metadata is at risk.

How do I implement A/B testing with the model registry?

▼

The registry provides version management; traffic routing happens in your serving layer. Assign “champion” and “challenger” aliases to model versions. Route a percentage of traffic to each in your API gateway. If the challenger wins, atomically reassign the “champion” alias.

How do I migrate from file-based to database-backed storage?

▼

No built-in tool. Start a new server with a database backend, use the MLflow API to export/import runs, and copy artifacts. The community mlflow-export-import tool automates this. Key lesson: start with a database store from the beginning for any team use.

What are the practical scaling limits of a self-hosted server?

▼

Runs: Millions with proper database indexing. Metrics per run: Hundreds of thousands. Concurrent users: Scale horizontally behind a load balancer; a single Gunicorn instance handles ~100 concurrent clients. Artifacts: Scales with the underlying object store. For thousands of data scientists, consider Databricks managed MLflow.

Common Q&A