Easy and Secure LLM Inference and Retrieval Augmented Generation (RAG) Using Snowflake Cortex

Because human-machine interaction using natural language is now possible with large language models (LLMs), more data teams and developers can bring AI to their daily workflows. To do this efficiently and securely, teams must decide how they want to combine the knowledge of pre-trained LLMs with their organization’s private enterprise data in order to deal with the hallucinations (that is, incorrect responses) that LLMs can generate due to the fact that they’ve only been trained on data available up to a certain date.

To reduce these AI hallucinations, LLMs can be combined with private data sets via processes that either don’t require LLM customization (such as prompt engineering or retrieval augmented generation) or that do require customization (like fine-tuning or retraining). To decide where to start, it is important to make trade-offs between the resources and time it takes to customize AI models and the required timelines to show ROI on generative AI investments.

While every organization should keep both options on the table, to quickly deliver value, the key is to identify and deploy use cases that can deliver value using prompt engineering and retrieval augmented generation (RAG), as these can be fast and cost-effective approaches to get value from enterprise data with LLMs.

To empower organizations to deliver fast wins with generative AI while keeping data secure when using LLMs, we are excited to announce Snowflake Cortex LLM functions are now available in public preview for select AWS and Azure regions. With Snowflake Cortex, a fully managed service that runs on NVIDIA GPU-accelerated compute, there is no need to set up integrations, manage infrastructure or move data outside of the Snowflake governance boundary to use the power of industry-leading LLMs from Mistral AI, Meta and more.

So how does Snowflake Cortex make AI easy, whether you are doing prompt engineering or RAG? Let’s dive into the details and check out some code along the way.

To prompt or not to prompt

In Snowflake Cortex, there are task-specific functions that work out of the box without the need to define a prompt. Specifically, teams can quickly and cost-effectively execute tasks such as translation, sentiment analysis and summarization. All that an analyst or any other user familiar with SQL needs to do is point the specific function below to a column of a table containing text data and voila! Snowflake Cortex functions take care of the rest — no manual orchestration, data formatting or infrastructure to manage. This is particularly useful for teams constantly working with product reviews, surveys, call transcripts and other long-text data sources traditionally underutilized within marketing, sales and customer support teams.

SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_text) FROM reviews_table LIMIT 10;

Of course, there are going to be many use cases where customization via prompts becomes useful. For example:

All of these and more can quickly be accomplished with the power of industry-leading foundation models from Mistral AI (Mistral Large, Mistral 8x7B, Mistral 7B), Google (Gemma-7b) and Meta (Llama2 70B). All of these foundation LLMs are accessible via the complete function, which just like any other Snowflake Cortex function can run on a table with multiple rows without any manual orchestration or LLM throughput management.

*Figure 1: Multi-task accuracy of industry-leading LLMs based on MLLU benchmark. Source*

SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
        CONCAT('Summarize this product review in less than 100 words. 
        Put the product name, defect and summary in JSON format: <review>',  
        content, '</review>')
) FROM reviews LIMIT 10;

For use cases such as chatbots on top of documents, it may be costly to put all the documents as context in the prompt. In such a scenario, a different approach may be more cost effective by minimizing the volume of tokens (a general rule of thumb is that 75 words approximately equals 100 tokens) going into the LLM. A popular framework to solve this problem without having to make changes to the LLM is RAG, which is easy to do in Snowflake.

What is RAG?

Let’s go over the basics of RAG before jumping into how to do this in Snowflake.

RAG is a popular framework in which an LLM gets access to a specific knowledge base with the most up-to-date, accurate information available before generating a response. Because there is no need to retrain the model, this extends the capability of any LLM to specific domains in a cost-effective way.

To deploy this retrieval, augmentation and generation framework teams need a combination of:

Client / app UI: This is where the end user, such as a business decision-maker, is able to interact with the knowledge base, typically in the form of a chat service.
Context repository: This is where relevant data sources are aggregated, governed and continuously updated as needed to provide an up-to-date knowledge repository. This content needs to be inserted into an automated pipeline that chunks (that is, breaks documents into smaller pieces) and embeds the text into a vector store.
Vector search: This requires the combination of a vector store, which maintains the numerical or vector representation of the knowledge base, and semantic search to provide easy retrieval of the chunks most relevant to the question.
LLM inference: The combination of these enables teams to embed the question and the context to find the most relevant information and generate contextualized responses using a conversational LLM.

*Figure 2: Generalized RAG framework from question to contextualized answer.*

From RAG to rich LLM apps in minutes with Snowflake Cortex

Now that we understand how RAG works in general, how can we apply it to Snowflake? Using the Snowflake platform’s rich foundation for data governance and management, which includes vector data type (in private preview), developing and deploying an end-to-end AI app using RAG is possible without integrations, infrastructure management or data movement using three key features:

*Figure 3: Key Snowflake features needed to build end-to-end RAG in Snowflake.*

Here is how these features map to the key architecture components of a RAG framework:

Client / app UI: Use Streamlit in Snowflake out-of-the box chat elements to quickly build and share user interfaces all in Python.
Context repository: The knowledge repository can be easily updated and governed using Snowflake stages. Once documents are loaded, all of your data preparation, including generating chunks (smaller, contextually rich blocks of text), can be done with Snowpark. For the chunking in particular, teams can seamlessly use LangChain as part of a Snowpark User Defined Function.
Vector search: Thanks to the native support of VECTOR as a data type in Snowflake, there is no need to integrate and govern a separate store or service. Store VECTOR data in Snowflake tables and execute similarity queries with system-defined similarity functions (L2, cosine, or inner-product distance).
LLM inference: Snowflake Cortex completes the workflow with serverless functions for embedding and text completion inference (using either Mistral AI, Llama or Gemma LLMs).

*Figure 4: End-to-end RAG framework in Snowflake.*

Show me the code

Ready to try Snowflake Cortex and its tightly integrated ecosystem of features that enable fast prototyping and agile deployment of AI apps in Snowflake? Get started with one of these resources:

To watch live demos and ask questions of Snowflake Cortex experts, sign up for one of these events:

Want to network with peers and learn from other industry and Snowflake experts about how to use the latest generative AI features? Make sure to join us at Snowflake Data Cloud Summit in San Francisco this June!

The post Easy and Secure LLM Inference and Retrieval Augmented Generation (RAG) Using Snowflake Cortex appeared first on Snowflake.