Building a Better AI Agent with Claude

Listen to this article:

As GenAI continues to evolve at a mind-boggling pace, agents appear to be the next exponential step in the evolution of this emergent technologyI. In the last 18 months, Loka’s engineering squad has embraced AI agents for their productivity-boosting abilities in complex reasoning, data analysis and data visualization. Many options for configuring agents are available, and we’ve experimented with a vast array while working on more than 100 recent active client engagements.

Through trial and error, we’ve determined that AWS cloud-based agents running on Anthropic’s Claude LLMs are right now one of the most flexible, least error-prone options available. We're especially keen on the latest and greatest Claude update, Sonnet, which Anthropic released in late June.

In this post we’ll explain how we came to that conclusion, including the rationale behind our choice of LLM, the role of agents and how we optimized a Claude-based agent.

Cloud Vs Open Source Vs Proprietary Solutions

Most Large Language Models (LLMs) are closed-source, meaning the organizations that create them prevent public access to their model weights, datasets and code. For example, well-known companies like Open AI and Anthropic have built their own proprietary platforms encompassing a web playground and an API. Closed-source platforms offer useful features, but they also present problems. For instance, by focusing on one provider platform you risk locking in against alternatives, putting your users at the mercy of reliability and maintenance of a single source. Privacy and data management are also at risk, as prompts and the data sent with them are sent through an external company’s servers, subject to their policies on user content.

Open models are an alternative. They give full accessibility and flexibility and enable users to deploy the LLM themselves without third-party intervention. They allow data-flow management and easy switching to other options if necessary. However, the flexibility of open-source models requires a lot of hands-on attention, as deployment and infrastructure are no longer delegated to a third party but rather are yours to manage.

Perhaps the sweet spot between these two approaches lies in cloud-based solutions. Cloud providers such as AWS have been pushing LLM creators to release new products to facilitate this new technology on their platforms. Although cloud services don’t offer all existing LLMs, they do provide a good mix of open and closed models and sometimes the cloud provider’s own. They’re flexible about model choice while offering a managed service complete with infrastructure. Plus, cloud-based LLM services exist within the cloud’s environment, which typically allows privacy and granular permission controls, data hosting options and easy access to other services running in the same cloud provider.

Amazon Bedrock

Through a single API, AWS Bedrock provides a fully managed, serverless, cloud-based generative AI service, with offerings from Amazon itself like Anthropic, Mistral, Meta, Cohere, Stability AI and others. Additionally, Bedrock encrypts all its data and complies with several standards, including GDPR and HIPAA; the latter is particularly relevant in the life sciences industry. Being an AWS service, Bedrock provides custom access to endpoints, adding an additional layer of security and addressing concerns about privacy and security. It brings additional features such as provisioned throughput for lower latency and higher rate limits, as well as fine tuning models and hosting them privately on AWS, but that isn’t going to be discussed in this post.

All in all, the perks of Amazon Bedrock make for a strong option versus a closed platform such as OpenAI or an open model.

Anthropic Claude

‍Amazon Bedrock offers access to Anthropic’s LLMs. Anthropic, an established spinoff of OpenAI, is one of the leaders in generative AI. In particular, the Claude model family is recognized as a top-performing LLM with advanced reasoning capabilities, relatively low rates of hallucination and one of the largest context windows available (200K tokens, soon beyond 1M). This window size means that on a single prompt, Claude can read around 150K words or 500 pages of text. Large prompts bring their own concerns, including higher latency, higher costs and potential loss of information, but it’s very useful to have the capability to provide a lot of contextual information. This information can come in the form of large documents for analysis, guidelines the LLM should follow, examples of expected behavior to mold the LLM, previous messages of a chat or thought process, tools available for an agent to use, etc.

Many versions of Claude are available. Claude Opus is probably the best one to use for agents, but cost and speed might make us prefer other options.

Speaking of agents and thought processes, let’s look at exactly what an agent is.

Agents

LLMs queried in simple prompts can provide interesting insights and some reasoning capabilities based on the large and diverse datasets they were trained on. However, they’re prone to hallucinations, i.e. generating incorrect information, and don't have access to updated data on their own. These issues are particularly salient when dealing with problems that require complex reasoning, domain specific knowledge, mathematical calculations or coding; essentially, tasks that even humans rarely get right without the help of external sources. Thus, in order for LLMs to be able to tackle a wide variety of tasks, they need the ability to interact with data sources and tools that complement their capabilities in an iterative, planned process. This is where agents come in.

Giving LLMs access to tools means that an LLM can decide to use a tool that is relevant for the task at hand. These can be external data sources, programmatic functions, LLM prompts or even other agents. For example, when faced with a mathematical operation, it can use a calculator tool instead of trying to guess the answer based on its pre-existing training.

Agent types vary, but most involve multiple steps of actions (e.g. use a specific tool) and observations that result from each action. This way, agents can revise their plans, correct potential faults in their logic and deconstruct complex queries into smaller, more manageable subtasks. How the agent plans what to do is an evolving field.

After successfully completing its path, the agent presents the final answer to the query. Usually this comes in the form of text and might not reveal much about the reasoning behind it. However, as illustrated in the diagram above, the agent can be adjusted so that it outputs additional content, including interactive plots, tables, tool outputs and results of code written by the agent.

Building a Claude-Based Agent

‍As in any machine learning model, LLMs are best used in a way similar to how they were trained. For instance, Claude LLMs have been fine-tuned on XML formatted text, a type of structure that is common in web development, characterized by an hierarchical labeled structure. This means that performance improves if prompts are organized into XML tags.

Anthropic Tools

‍While AI agents are still a fairly new and maturing concept, Python libraries such as langchain and llamaindex are already established leaders. Both tend to focus on OpenAI’s GPT models first, open source second and eventually other providers such as Anthropic. Because Anthropic’s Claude is optimized for XML, which is quite distinct from other LLM prompting strategies, these libraries aren’t ideal for developing Claude-based agents.

Fortunately, Anthropic started its own GitHub repository for building agents with Claude: anthropic-tools. As of April 2024, this package is still in alpha. Loka has been using it with forked versions of its own, with some changes to improve stability and adapt to clients’ use cases. It’s new, but anthropic-tools surpasses other libraries when developing agents with Claude. It’s a product Anthropic itself made from the ground up to use XML, resulting in very capable agents that rarely face issues with their iterations.

Developing Tools with Claude

‍Agents benefit from having a planning module. Similar to approaches like chain of thought prompting, planning modules help break down complex problems into smaller, iterative steps before declaring a final answer. Anthropic-tools don’t have a built-in planning module, but it does allow for any kind of tool to be implemented. A Claude-based planner tool for the agent can be particularly useful in addressing queries that require multiple logical steps. In order to get it working optimally, consider the following guidelines:

Give as much context as possible to the planner tool’s prompt.
Pass in the descriptions of the agent’s other tools.
Provide examples of plans that it should define for some common types of queries.
Emphasize its importance in the tool’s description.
Incentivize the agent to reuse the planner if it gets stuck or something fails in its current plan.

Note that while the plan can be refined, this approach doesn’t guarantee usage of the planning module and is a relatively rudimentary approach to planning. Alternative methods such as Plan-and-Execute allow for more rethinking of the plan based on observed outputs, as well as ways to accelerate inference via parallel task execution like LLMCompiler. However, these methods would require more profound changes to anthropic-tools or a better integration of Claude in libraries such as langchain and llamaindex.

Feature Selection

‍Agents can be used for data analysis, assisting with identifying patterns, providing insights, contextualizing the dataset and generating data visualizations. For most of these tasks, the data needs to be filtered down to the features that are most relevant for the query at hand. An LLM might struggle to identify which columns to use, if just given a table, particularly if the dataset is large. At Loka, we noticed that developing a tool for selecting features based on their names, descriptions and categories significantly improves feature retrieval. More specifically, the use case involved:

Writing a description for each feature in vocabulary that is detailed, clearly distinguishes columns and is intuitive for someone without much domain knowledge (i.e. the LLM)
Assigning categories for each feature, depending on the use cases; at a recent Loka project, features were separated into measurements from lab instruments, sample identifiers and time-related
Developing a feature selection tool, using anthropic-tools, which received as inputs the user’s query and the category from which to select feature(s); based on relevant context (e.g. an experiment’s document) alongside the names and descriptions of the variables that belonged to the selected category, the tool gives back structured output with a list of selected feature names and the reasoning behind it

Say for instance your goal is to create a plot. With this tool, instead of having the agent try to guess the correct column names and inspect the raw data when it fails, it can use feature selection to more reliably identify the relevant variables for each axis of the plot.

Python Code Execution

‍Unchecked LLM output text might contain incorrect information. One way to avoid bad guesses is to ask the LLM to write code. If the code is correct, it should be more reliable as it performs the same kind of calculations as a data analyst would do. However, LLM-generated code can also be flawed, either failing to run or giving unintended results. So, within a safe sandbox (far away from nuclear codes), the LLM can run its code and iterate on it if there are any issues, just like a programmer would do.

Room for Improvement

‍Agents continue to have their limitations, potentially going off rails of the intended prompted behavior, hallucinating fake information and failing to tackle very complex problems. Other areas we see room for improvement include

Reduced latency. An agent with Claude, with large prompts, on Amazon Bedrock, over multiple steps can take up to a few minutes to answer.
Better planning modules. For example, Plan-and-Execute, parallel task execution.
Automatic output evaluation on a subset of queries.
LLMs that are fine tuned for specific tasks.

That said, at Loka we believe that this technology will continue to grow. We’ve worked with numerous clients to bring agents to production. With the right tools, best practices and well-defined scopes, agents are already proving their worth.

Artificial Intelligence

July 24, 2024

Loka Staff

André Ferreira

Machine Learning Engineer at Loka

AND

Telmo Felgueira

Machine Learning Engineering Manager

AND

Building a Better AI Agent with Claude

Listen to this article: