An Introduction to the LAS Model Deployment Service
Troy W., Brent Y., Michael G., Aaron W.
The LAS Model Deployment Service (MDS) is a prototype system providing a cloud provider agnostic environment for deploying, scaling and monitoring ML models. The MDS intends to serve two core purposes, support LAS research efforts and evaluate solutions for scaling ML modes in order to better inform the development of new production environments.
Background
Our original intentions and discussions on a model deployment service focused heavily on how we could simplify and standardize the process of allowing an LAS researcher to present ML driven prototypes to an analyst. Current solutions involving Jupyter notebooks or one off inference servers were cumbersome and presented a high barrier to entry. The hope was that we could reduce the burden on our data scientists in demonstrating their work and receiving feedback from analysts.
It quickly became apparent that these systems are complex and the research helped us understand those complexities and requirements for deploying containerized machine learning models in a semi production environment. It wasn’t as simple as making a process to wrap up a model in a container and then run that on a server. There would need to be adequate monitoring systems to determine model and data drift, highlight outliers and aid in model explainability. Without these systems it would be difficult to tell if a deployed model was functioning properly and diagnose issues. We’d also need a consistent pipeline and set of build processes to ensure our users could easily and consistently use the system.
Layers of MLOps
For the purpose of this blog post let us separate ML infrastructure into 3 layers: Data Processing, Training, and Serving. Each layer can be a separate and independent system or set of systems and does not need to be owned by the same organization. The MDS belongs in the Serving layer, which focuses on repeatable deployments that scale, monitoring for both micro service and model health and providing standard interfaces for retrieving ML model inferences. Due to the depth and complexity of each of these layers we’re going to focus primarily on the serving layer.
Model Servers/Multi-model serving
Embedded Model: The model runs directly inside of a streaming application allowing the application to use it directly.
External Server: Containerizing the model into a specialized server that allows inference using HTTP or gRPC. Within this category there is two types of servers:
- Specialized servers for serving a specific type of model such as Tensorflow(TF) Serving for any tome of TF model.
- Generic compound servers allow for the serving of graphs of deployed models which can be in a variety of formats. Nvidia’s Triton and Seldon’s MLServer fall into this category.
Our original design reflected a bit of the methodology if it’s time and focused on creating self contained model servers, this works but is incredibly wasteful. Instead following the trajectory of industry leaders such as Nvidia’s Triton and Seldon MDS is focusing on serving more models with fewer model servers. Often we can run models without the need to create a custom server. Seldon Core V2 establishes the underlying structure allowing us to define the capabilities of each model server deployed, after when we deploy a model Seldon selects an appropriate server for the model’s needs and takes care of loading it into the server. This accomplishes three key improvements: more efficient usage of existing deployed servers, models can scale separately from servers, model repositories need only to store artifacts specific to running a model.
We found the external generic compound server strategy was the most widely applicable for our use cases at LAS. Seldon Core V2 is one of the more mature open source offerings available and was chosen to be the core of the MDS system. We liked a lot of the standard features including:
- Model framework agnostic
- Stock metrics for monitoring system and model health
- Kubernetes native
- Batch and streaming processing through Apache Kafka
Pipelines/Workflows
Using Seldon’s concept of pipelines allows MDS to facilitate a variety of essential features necessary for serving models in production:
- Batch/stream processing
- Outlier and drift detection
- Chaining, joining and conditional logic
Facilitated through integration with Apache Kafka, Seldon manages creation, deletion and data routing through Kafka topics. This not only provides an asynchronous structure for independently scaling individual pieces of a given pipeline but allows pipelines to be added or configured without needing to interrupt models currently serving production needs.
Monitoring
We need to know when a model’s performance degrades and how to explain that degradation. Concept drift monitoring makes sure that the model still behaves correctly while model explainability allows us to figure out why a model came to a certain decision to ensure trust in the model’s behavior.
Seldon Core offers a number of features that enable ease of monitoring and metrics.
- Alibi Explain integration for aiding in model explainability with an explainer pipeline.
- Alibi Detect can be used for outlier and adversarial detector monitoring with a monitoring pipeline.
- Prometheus and Grafana integration provide advanced and customisable metrics monitoring.
- Full auditability through model input-output request logging integration with Elasticsearch.
- Microservice distributed tracing through integration to Jaeger for insights on latency across microservice hops.
LAS Model Deployment Service, where are we now?
Last year we presented an introduction to machine learning operations (MLOps) along with the drive and concepts fueling LAS’s creation and research into a model deployment service (MDS). Building on that foundation we’ve taken our initial single machine prototype and expanded it into a fully scalable Kubernetes cluster in Amazon’s Elastic Kubernetes Service (EKS). This blog is intended to cover all the changes we’ve seen over the last year.
The MLOps landscape has changed dramatically in the past year with the rise in popularity of large language models (LLMs). Model pre-deployment optimization can only go so far and industry leaders such as Seldon and Nvidia have made advances in serving that not only benefit LLMs but also traditional ML models. The MDS’s design has been updated and adapted to take advantage of improvements to serving ML models, with a focus on serving traditional ML Models and workflows. While we have considered the changes needed for supporting LLMs our design doesn’t fully accommodate a widely scalable LLM at this time; however we anticipate researching this in the upcoming year.
Original Design and Methodology
MDS’s original approach to model serving focused on creating a model container with a standardized interface, which could be deployed in Kubernetes using Seldon Core. Monitoring was facilitated through a combination of prometheus metrics and K-Native eventing triggers. This initial approach was a good starting point but presented some challenges.
Issues/Challenges
User Experience
- Building a container for each model puts a heavy burden on users and infrastructure.
- The standardized interface was difficult to implement for more complex input like multi type vector arrays.
System Requirements
- Storing a separate combined image of model and model server unnecessarily increases system storage requirements as it’s mostly redundant. Model servers are often similar or identical.
- Deployment of models and model servers is a 1 to 1 ratio. Deploying or scaling a single model requires the system to also add a model server. Each model server consumes a base set of resources to serve a model. While these might be small they tend to add up in cost as the system scales.
System Scalability
- Model workflows/pipeline deployments are strongly coupled with model deployment. Significant changes to a model or a pipeline could require the whole pipeline to be redeployed.
- The system could use Apache Kafka to facilitate batch processing/streaming but integration was complex and difficult.
Updated Design and Methodology
MDS is still focused on serving models in servers running in Kubernetes, however with the upgrading to Seldon’s Core V2 product has pivoted to a multi-model data-centric serving approach that addresses the above concerns. This approach conceptually separates the deployment of model, model server and model pipeline.
Advantages
User Experience
- Once a server exists that is capable of running a model users only need to focus on the code and configuration specific to running inference on their model.
- Sedon Core V2 uses the Open Inference protocol (formally known as V2 Inference Protocol). Not only does this specification standardize expected service endpoints but also how to structure complex data in requests. Notably this is the same specification used by Nvidia’s Triton which allows users already utilizing Triton run seamlessly on Core V2.
System Requirements
- Model artifacts can be stored and versioned separately from the servers responsible for serving them, greatly reducing storage requirements.
- Multi-model serving utilizes a smaller set of servers reducing infrastructure compute and memory costs.
System Scalability
- Pipelines/workflows are deployed separately from models and model servers. This allows pipelines to be changed, updated and scaled dynamically with far less impact on production.
- Native and quick integration with Apache Kafka. Seldon Core V2 uses Kafka as a core component of its system architecture. Neither users nor engineers need to know how to connect, create, update, delete or publish to Kafka; this is all handled through Core V2.
Great, we have a deployment system! How do we use it?
Since we’ve chosen to focus on the external server method for model deployment, let’s take a look at some of the methods in which this can be used to enable analysts, data scientists and existing applications to integrate ML into their workflows.
Analysts:
- Facilitation of human machine teaming to support decision making.
Data Scientists:
- Easier, more consistent means of getting analyst feedback.
- Quicker testing of resource intensive models through running in scalable architecture.
Analytic/Platform Developers:
- Consistent interfaces and independent scaling ensure existing and new tools will be able to take advantage of ML in their decision making processes without the need to directly embed the ML models within their codebase.
Final thoughts
There are many varied methods of building ML infrastructure and deploying ML models, the systems covered in this blog are one of many ways that we can take advantage of ML infrastructure. These methods will not cover every need but we believe they can actually cover a broad range of uses and solutions.
Infrastructure will be essential to drive further improvements and integration of ML models into the decision making processes of users. We hope that our experiments and lessons learned will provide others with the knowledge needed to create systems at a corporate level.
Important Terms:
Technology List:
Monitoring:
- Graphana – https://grafana.com/
- Prometheus – https://prometheus.io/
- Opentelemetry – Monitoring events and workflows in the system. This is used with Jaeger to understand where system processing bottlenecks occur.
- Jaeger Tracing – https://docs.seldon.io/projects/seldon-core/en/latest/graph/distributed-tracing.html
- Alibi Detect – https://github.com/SeldonIO/alibi-detect
- Alibi Explain – https://github.com/SeldonIO/alibi
Model Repository/Model Deployment:
- Seldon CoreV2– https://docs.seldon.io/projects/seldon-core/en/v2/contents/getting-started/
- Envoy – Uses envoy to route traffic internally and externally. Can but used behind other ingress options like NGINX
- Bailo – https://github.com/gchq/Bailo
- Seldon MLServer – https://mlserver.readthedocs.io/en/latest/
- Triton – https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
Autoscaling:
- KEDA – https://keda.sh/docs/2.10/
Streaming/Scaling Support:
- Kafka – https://docs.seldon.io/projects/seldon-core/en/latest/streaming/kafka.html
- Argo – https://argoproj.github.io/argo-workflows/
API Specs:
- V2 inference spec – https://kserve.github.io/website/0.8/modelserving/inference_api/
- Seldon prediction – https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/external-prediction.html
Useful Development Tools/Python packages:
- MLServer – Is also a python package usable as a command line tool and when imported into python can assist in formatting calls to the inference V2 spec.
- Triton Client – useful for making calls to anything running the V2 spec API
- Ansible – Seldon provides ansible playbooks for setting up a V2 dev environment.
- Kind – Tool for creating and local mini kubernetes clusters running in a docker container.
- Rclone – Tool used for copying and syncing data between remote or local sources. Useful for copying model weights out of remote storage into local volumes when deploying models.
- Cookiecutter – Tool for creating projects from a defined project template.
- Categories: