fbpx
Techitup Middle East
Expert Opinion

AI, Training, Inference and RAG – A New Dawn for AI

By Patrick Smith, Field CTO EMEA at Pure Storage

The 30th of November 2022 was a monumental day. That was the day that ChatGPT was released to the world by OpenAI, the rest is history; literally. It’s been two years since then and we’ve seen a meteoric rise in the interest in AI. This has led to an almost 10x increase in the market capitalisation of NVIDIA, the leading maker of GPUs, and wild predictions around the potential total investment by businesses in AI, as well as the impact it will have on society.

This feels very different to the previous AI dawns we’ve seen over the last 70 years, from the Turing Test, defeats of chess grand masters, to autonomous driving and now the Generative AI explosion. The game has well and truly changed, but it’s still based on certain fundamental concepts. For many years, AI advancements have been built on three key developments; 1) more powerful compute resources — in the form of GPUs; 2) improved algorithms or models — in the case of Generative AI the Transformer architecture and large language models (LLMs), and finally; 3) access to massive amounts of data. At a very high level the phases of an AI project include data collection and preparation, model development and training, and model deployment also known as inference.

It’s all about the data

Data collection and preparation cannot be overlooked: good quality, relevant, and unbiased data is key to a successful AI project. It’s often cited that organizations are challenged in understanding their data, identifying data ownership, and breaking down silos to allow that data to be effectively used. Without access to high quality data, an initiative is unlikely to succeed. Increasingly organizations are using multimodal data, not just text, but also audio, images, and even video in their AI projects. The amount of data, and therefore the underlying storage requirements, are significant.

Training the model

The training phase is typically approached in one of two ways. Foundational model training, which involves leveraging a huge amount of data, building an AI model from the ground up and iteratively training that model to produce a general model for use. This is typically carried out by large technology companies with a lot of resources; Meta have recently talked about training their open source Llama 3.1 405 billion parameter model with over 15 trillion tokens and it is reported that this took around 40 million GPU Hours on 16,000 GPUs. This long model training time highlights a key aspect for training large models: frequent checkpointing to allow for recovery from failures. With large models it’s essential that the storage used for checkpointing has very high write performance and capacity.

The second training approach is model fine tuning. This involves taking an existing model, where another organization has done the heavy lifting, and applying domain specific data to that model through further training. In this way, an organization benefits from its own personalized model but doesn’t need to train it from scratch.

Whatever the approach, training requires massively parallel processing with GPUs, necessitating high throughput and access speeds to handle large datasets efficiently. Data storage for AI training must therefore provide very high performance, not least to keep GPUs fed with data, scalability to manage large training datasets, and reliability given the importance and cost of training models.

Into production

Once a model has been trained, and its performance meets requirements, it’s put into production. This is when the model uses data it hasn’t seen before to draw conclusions or provide insights. This is known as Inference, and is when value is derived from an AI initiative. The resource usage and cost associated with inferencing dwarfs that of training because inferencing has demands on compute and storage on a constant basis and potentially at massive scale; think about millions of users accessing a chatbot for customer service.

The underlying storage for inferencing must deliver high performance as this is key to providing timely results as well as easy scaling to meet the storage requirements of the data being fed into the model for record-keeping and to provide retraining data. The quality of the results from inferencing is directly related to the quality of the trained model and the training data set. Generative AI provided a twist to the accuracy of inferencing; the nature of Generative AI means that inaccuracies are highly likely, known as hallucinations. These inaccuracies have caused problems that have frequently hit the headlines.

Improving accuracy

Users of ChatGPT will realize the importance of the query fed into the model. A well-structured comprehensive query can result in a much more accurate response than a curt question. This has led to the concept of “prompt engineering” where a large well-crafted dataset is provided as the query to the model to yield the optimal output.

An alternative approach that is becoming increasingly important is retrieval augmented generation, or RAG. RAG augments the query with an organization’s own data in the form of use-case specific context coming directly from a vector database such as Chroma or Milvus. Compared to prompt engineering, RAG produces improved results and significantly reduces the possibility of hallucinations. Equally important is the fact that current, timely data can be used with the model rather than being limited to a historic cut-off date. 

RAG is dependent on vectorizing an organization’s data, allowing it to be integrated into the overall architecture. Vector databases often see a significant growth in the dataset size compared to the source — as much as 10x — and are very performance sensitive given that the user experience is directly related to the response time of the vector database query. As such the underlying storage in terms of performance and scalability plays an important part in the successful implementation of RAG.

The AI energy conundrum 

The past few years have seen electricity costs soar across the globe, with no signs of slowing down. In addition, the rise of Generative AI means that the energy needs of data centres have increased many times over. In fact, the IEA estimates that AI, data centres, and cryptocurrency power usage represented almost 2% of global energy demand in 2022 — and that these energy demands could double by 2026. This is partly due to the high power demands of GPUs that strain data centres, requiring 40-50 kilowatts per rack — well beyond the capability of many data centres. 

Driving efficiency throughout the datacenter is essential meaning infrastructure like all-flash data storage is crucial for managing power and space, as every Watt saved on storage can help power more GPUs. With some all-flash storage technologies, it’s possible to achieve up to 85% reduction in energy usage, and up to 95% less rack space than competing offerings, providing significant value as a key part of the AI ecosystem.

Data storage part of the AI puzzle

AI’s potential is almost unimaginable. However, for AI models to deliver, a careful approach is needed across training, whether foundational or fine tuning, to result in accurate and scalable inference. The adoption of RAG can be leveraged to improve the output quality even further.

It’s clear that in all stages data is a key component, flash storage is essential in delivering AI’s transformative impact on business and society, offering unmatched performance, scalability, and reliability. Flash supports AI’s need for real-time access to unstructured data, facilitating both training and inference, while reducing energy consumption and carbon emissions, making it vital for efficient, sustainable AI infrastructure.

Related posts

Why Your Cloud Security Strategy May Be Obsolete by 2025

Editor

Navigating the Cost Dilemma in the World of Hybrid Multi-Cloud 

Editor

Upgrading Your RAM?: Benefits of Greater Memory Capacity 

Editor

Leave a Comment