Software Engineer
Note: This article was written for a mostly non-technical audience. While researching AI vendor partners with some clients, we wanted to demystify how LLMs are actually used in production, without using too much jargon.
State of the art LLMs are trained on huge datasets, often at the size of “the whole internet”. This allows the LLMs to answer questions with broad and deep expertise. However, any information not in the training dataset is not “known” to the LLM. This poses a problem if we want the LLM to answer questions about a specific dataset. So how do we teach an LLM new things?
We could attempt to build a brand new LLM on just the data for our problem domain. However, this dataset isn’t big enough to train an LLM with the advanced reasoning capabilities of a ChatGPT level model. And even if we did have enough data, it’s also very expensive to train a foundational model.
What if we could teach an already smart foundational model to do a more specific task, like answering customer service questions? Fine Tuning allows us to take a foundational (open source) model, like Qwen or LLama, and additionally train it on domain specific data.
For example, to train an LLM to be a customer service chatbot, we could prepare a dataset of customer service questions and known answers, then fine tune the base model with that additional data. This should result in a model that is tuned to respond as a reasonably smart customer service rep.
This is still fairly expensive to do, as it requires GPU compute time to fine tune a model. Results will also vary depending on the size of the training dataset, and the quality of questions/answers. Also, because the base model knows about much more than the fine tuning dataset, it can answers questions well outside the scope of its task. For example, with some nefarious prompt engineering, a user may be able to get the chatbot to generate instructions on how to build a bomb. 😬
For some tasks, a foundational model may be good enough so long as all the relevant information is in the incoming prompt.
For example, say we want to build a customer service chatbot that answers questions about a specific product, and say we have a 10 page customer service manual for that product. We could just stick the entire text of the customer service manual into the prompt before the customer query. This is often called prompt stuffing. We can instruct the LLM to respond in the manner of a customer service agent. For example, we could ask “Respond in the manner of a polite and helpful customer service agent. Use a warm and professional tone.” This is often called prompt engineering.
The format of the prompt we send to the LLM may look like this:
The costs for this approach are much smaller (in some cases free) compared to training. This approach has the same potential risks as using a fine tuned model, in that clever prompt engineering can get the chatbot to go “off script”.
What if our relevant data is too big to fit into a prompt? Say we are building a customer service chatbot not for just one product, but for a brand’s entire product catalog. Let’s say the customer service resources are available on a searchable web portal. We could use this search portal to provide relevant search results to help generate specific answers to a customer question. Here’s how that might work:
sequenceDiagram
participant User
participant LLM as AI Assistant
participant ToolUse as Tool Use middleware
participant Search as Search API
participant DocumentAPI
User->>LLM: "How do I use my bed base remote?"
LLM->>ToolUse: LLM generates a request in a special format that Tool Use middleware recognizes is a request for search results
ToolUse->>Search: Tool Use middleware queries search API<br/>"Bed Base Remote User Manual"
Search->>ToolUse: Search results for multiple remotes
ToolUse->>LLM: returns a prompt for user to select their product (from search results)
LLM->>User: which remote do you have (list results)
User->>LLM: answer
LLM->>ToolUse: LLM generates a request in a special format that Tool User middleware recognizes as a request for a specific document
ToolUse->>DocumentAPI: get text of relevant user manual
DocumentAPI->>ToolUse: User manual as text
ToolUse->>LLM: Tool Use middleware returns retrieved manual text as additional context for the LLM to use to answer user
LLM->>User: respond to user with speicfic help pulled from additional context, also provide link to document
This approach requires some kind of Tool Use Middleware that can look at LLM output and determine if/when the LLM is “asking” to search or retrieve a document. This approach also assumes that relevant data is available and queryable. If this isn’t the case, this approach will require building such a tool. This approach is basically an extension of prompt stuffing, where we add relevant context into the chat context. Note: neither this or prompt stuffing will “teach” the LLM anything new. The LLM only “knows” what has been added to the chat context. Once the chat is finished, it “forgets” any previous chats.
At their core, LLMs take input data and return output data. For chatbots, users enter text, the chatbot outputs text. Given this limitation, how do we get LLM powered tools to actually do something other than spit out more text?
Tool use (often called “function calling”) allows AI models to interact with external systems and perform actions beyond text generation. The 4. Retrieval Augmented Generation (RAG) section above is an example of tool use, as we gave the LLM the ability to 1. Search for Data and 2. Retrieve documents. How it works:
Chatbots output text, but that text isn’t limited to English prose. For example, chatbots can output code snippets. Often, chatbots will use the standard Markdown Code Block syntax to render code snippets like so:
print("This is some example code")
Because this format cleanly delineates code from prose, we could parse the LLM output to look for code snippets, and stuff that code into an web based IDE tool to provide the user the option to interact with that code. We could also (with extreme caution) execute code written by the LLM.
For example, our LLM could also generate LaTex syntax to display mathmatical notation.
\begin{align}
E_0 &= mc^2 \\
E &= \frac{mc^2}{\sqrt{1-\frac{v^2}{c^2}}}
\end{align}
–>
\[\begin{align} E_0 &= mc^2 \\ E &= \frac{mc^2}{\sqrt{1-\frac{v^2}{c^2}}} \end{align}\]We could ask the LLM to use Mermaid Diagram Syntax to render diagrams.
graph LR
A[Square Rect] -- Link text --> B((Circle))
A --> C(Round Rect)
B --> D{Rhombus}
C --> D
–>
graph LR
A[Square Rect] -- Link text --> B((Circle))
A --> C(Round Rect)
B --> D{Rhombus}
C --> D
We could also ask the LLM to generate tabular data in the CSV format, which we can then display as a table and/or import into a spreadsheet application
product,unit cost,unit profit
super cola, 0.99, 0.90
diet super cola, 0.99, 0.95
–>
product | unit cost | unit profit |
---|---|---|
super cola | 0.99 | 0.90 |
diet super cola | 0.99 | 0.95 |
What all the above examples have in common: we ask the LLM for a text format, but use that text to render something more than text. The lesson here is that any text based data format can be generated by an LLM, though your results may vary. Also to note, I’ve found that LLMs generally render decent output for Mermaid diagrams, but struggle to interpret Mermaid diagrams put into the prompt. Again, this will vary depending on the model’s training data.