Monday Ledger: The 5 Skills Your IT Team Needs to Support AI Infrastructure

This is the Monday Ledger, the delivery follow-on to Saturday's strategy post. Saturday's question was whether your cloud team can actually support AI infrastructure. Today's question is pragmatic, translating the strategy into practical steps: what skills are missing, and how do you find out?

The gap nobody is measuring

On Saturday, we saw that 54% of organisations have postponed or even cancelled AI initiatives because the infrastructure was too complex to manage. This data comes from the 2026 State of AI Infrastructure Report by DDN, which highlights that infrastructure complexity is the primary drag on AI return on investment.

That complexity rarely comes from the technology itself. The issue is the teams expected to support infrastructure they were never trained for. AI is taking over the world so fast that it is hard for them to keep up, particularly if they started their journey on the back foot in the first place.

Your IT support team knows cloud. They can manage uptime, ticket server alerts, patch virtual machines, and troubleshoot network latency at 2am. Those skills took years to build, and they matter. However, they do not translate directly to AI infrastructure, but they can transfer indirectly.

The gap is not about intelligence or effort. It is about a different set of problems that require a different set of skills, in a situation where technology is moving faster than ever. Most organisations have not yet defined what those skills are, let alone measured whether their teams have them. An AI readiness assessment is often a technology-oriented approach, but the skills and training issue is also a silent bottleneck in the delivery pipeline.

5 skills cloud experience does not give you

The transition from cloud to AI infrastructure is not a linear upgrade. It is a rethink of the fundamental unit of work and the definition of "health" for a system.

1. Model monitoring : not just infrastructure monitoring

Cloud monitoring means: is the server up? Is CPU at capacity? Is the database responding?

AI infrastructure monitoring means something different. A model can be running perfectly: for example, it can be running all green lights on the infrastructure dashboard, but it still be producing completely wrong answers. Model drift happens when outputs degrade over time because the data changes – but the model does not. Shadow errors happen when a model returns a confident-sounding result that is factually incorrect. Ultimately, it leads to bad decisions based on faulty knowledge founded on shifting data 'sand'.

Your team needs to know the difference between an infrastructure failure and a model failure. They need to know how to detect output degradation, not just service downtime. These are separate disciplines. Without data fluency, an IT team cannot interpret the subtle signals that indicate a model is no longer fit for its purpose.

Cloud teams know how to monitor systems. AI teams also need to monitor outputs.

2. Token and cost management

Cloud cost management has a counterpart most teams understand: compute hours, storage volumes, data transfer rates. The numbers are large but the logic is familiar. IT operations teams are accustomed to managing static or predictable resource allocations, but we do not often have these in place for AI systems.

AI cost management works on a different unit: the token. Every prompt sent to a model, every response returned, every document embedded into a vector store : these all consume tokens, and tokens cost money at a rate that scales fast with usage.

If an IT team cannot read a token usage dashboard, then it cannot manage AI spend. If an IT team does not understand context window limits, then it cannot diagnose why a prompt-heavy application is burning budget unexpectedly. This is not a finance team's problem. It sits squarely in IT operations, and most cloud-trained engineers have never had to think in tokens. Efficient data platform engineering requires a deep understanding of these micro-costs as well as optimal model choices..

Cloud teams manage compute and storage costs. AI teams also need to manage token budgets and prompt efficiency.

3. Prompt governance and versioning

In cloud infrastructure, configuration changes are tracked. Infrastructure as Code, version control, change logs – this is all the good stuff that your team already knows. Untracked changes cause incidents.

Prompts are configuration. A prompt is an instruction to a model, and changing it changes the model's behaviour. Yet, most organisations treat prompts like Post-it notes: informal, undocumented, changed by anyone with access, and impossible to audit when something goes wrong.

Your IT team needs to treat prompt management the same way they treat code or configuration management: version-controlled, reviewed, tested before deployment, and rolled back when it causes problems. Right now, very few teams have this process. That absence is a governance risk, at best. At worse, it is an incident waiting to happen (or perhaps one that has already happened, but it hasn't been trapped yet). Comprehensive AI governance training is the only way to mitigate this risk.

Cloud teams version infrastructure configuration. AI teams also need to version and govern prompts.

4. Data pipeline quality for AI : not just data movement

Cloud-trained data engineers move data. They build pipelines that extract, transform, and load data reliably. They care about whether the data arrives. They often do not need to care deeply about whether the data is good. For example, ensuring a managed identity for Azure SQL is correctly configured is a standard security task, but it does not guarantee the data within the database is suitable for a Large Language Model.

AI infrastructure is different. A model trained or grounded on poor-quality data will produce poor-quality outputs – confidently. The 8 C's of data quality are not abstract governance concepts. They are the checklist your team needs to run before data reaches an AI model:

Certainty: How accurate is the source?
Coverage: Does the data represent the entire problem space?
Completeness: Are there missing values that skew results?
Consistency: Is the data formatted uniformly across sources?
Currency: Is the data up to date?
Commonality: Are definitions shared across the business?
Chance: Is there random noise affecting the signal?
Consent: Do we have the legal right to use this data for AI?

If your IT team is responsible for the data pipelines feeding your AI systems, they need to understand data quality as a functional requirement : not just a data governance team's concern.

The gap: Cloud teams ensure data arrives. AI teams also need to ensure data is fit for model consumption.

5. AI incident response : who owns what when things go wrong

In cloud infrastructure, incident response is well-defined. There is a runbook. There is an escalation path. There is a clear line between a network problem, an application problem, and a database problem.

AI incidents are murkier. When an AI system produces a harmful output, who is responsible : the team that deployed the model, the team that wrote the prompt, the team that owns the data, or the vendor? When a model starts giving inconsistent answers, is that an infrastructure problem, a data problem, or a model problem?

Most IT teams do not have an AI incident response runbook. They do not have a RACI (Responsible, Accountable, Consulted, Informed) matrix that defines ownership across model, data, prompt, and output layers. When something goes wrong : and it will : they are improvising.

Improvised incident response at 2am is expensive. Defining the runbook before the incident is cheap. IT team upskilling for AI must include these operational protocols.

The gap: Cloud teams have established incident response processes. AI teams need equivalent processes built specifically for model, data, and output failures.

⬇ [Download the AI Infrastructure Skills Gap Checklist (PDF) →]

Or, if you'd rather talk through your team's specific situation: book a Surgery Hour and we can work through it together.

Jennifer Stirrup is an independent AI and data strategy consultant. She works with organizations that need AI to actually work : not just in the pilot, but in production. jenstirrup.com

Share the Post:

Artificial Intelligence

Monday Ledger: The 5 Skills Your IT Team Needs to Support AI Infrastructure

The gap nobody is measuring

5 skills cloud experience does not give you

1. Model monitoring : not just infrastructure monitoring

2. Token and cost management

3. Prompt governance and versioning

4. Data pipeline quality for AI : not just data movement

5. AI incident response : who owns what when things go wrong

Like this:

Related

Related Posts

The AI Subsidy Era is Over: Avoiding the Multi-Million Pound Mistakes of Microsoft and Uber

The Saturday Strategy: The Planning Fallacy Is Breaking Enterprise AI

AI KPIs Your CFO Will Actually Trust

Monday Ledger: The 5 Skills Your IT Team Needs to Support AI Infrastructure

The gap nobody is measuring

5 skills cloud experience does not give you

1. Model monitoring : not just infrastructure monitoring

2. Token and cost management

3. Prompt governance and versioning

4. Data pipeline quality for AI : not just data movement

5. AI incident response : who owns what when things go wrong

Share this:

Like this:

Related

Related Posts

The AI Subsidy Era is Over: Avoiding the Multi-Million Pound Mistakes of Microsoft and Uber

The Saturday Strategy: The Planning Fallacy Is Breaking Enterprise AI

AI KPIs Your CFO Will Actually Trust

Discover more from Jennifer Stirrup: AI Strategy, Data Consulting & BI Expert | Keynote Speaker