Exploring AI Agents: A Journey with 'llmstatemachine'
In this article, we will explore the implementation of generative AI agents, examining the difficulties and resolutions involved in navigating and engaging with dynamic digital environments.
November 30, 2023 – Mikko Korpela, Principal Engineer, Robocorp
“An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.” - Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 2nd ed. (Pearson Education, Inc., 2003), p. 32.
Building on this definition, in this post, we'll dive into the application of generative AI agents, closely examining the challenges and solutions encountered in my project, llmstatemachine, which explores ways to navigate and interact within dynamic digital environments.
My Path to AI
Having spent 18 years in the software industry, my relationship with AI has evolved from viewing it as a fascinating theoretical concept to recognizing its potential for real-world applications. This journey led me to explore the intersection of AI and problem-solving.
Beginning of last year 2022 I recall being first exposed to GitHub Copilot. With a group of friends, we did an example of a test case where the data was about Tony Stark, the fictional character in the Marvel Universe whose alter ego is Iron Man. GitHub Copilot suggested another case with data related to Bruce Banner and his alter ego The Hulk. - I was amazed.
Facing the Challenges
Despite the rapid advancements in AI development, the agent's ability to follow a plan remains challenging, resulting in high error rates in GenAI-based agents and highlighting AI's complexity in maintaining focus and purpose.
One potential approach to improve the task completion rate would be to restrict agents' actions and force them to perform a set of steps to reach their goals. This led me to finite state machines that can represent a workflow with limited allowed actions that need to be accomplished. I began to explore this concept with my llmstatemachine project.
I focused on cases where an agent may not observe the whole environment. Examples of these partially observable environments are all around.
HTML Element Locator agent
The first agent problem I focused on was HTML element locator discovery. This is a specific problem for web browser automation. When searching for dynamically changing information from a web page, locating the elements containing the target information can be difficult.
In this context, the environment that an agent is working on is the HTML DOM. This is an element tree containing the target information. The goal is to find a method, a CSS selector to be specific, to select the tree branches with the target information. Then you can re-use the same method later to discover the new changed information from the exact same positions. Building a robust strategy that matches all targets and works over a long time can be challenging.
I made functionalities for an agent to observe some representation of the HTML DOM. I’ll call a function that makes an observation a view.
A View:
- a partial representation of the environment.
- does not modify the environment.
- is controllable by the agent.
In my explorations for the HTML DOM, I implemented 3 views: focus, select, and validate.
Focus view gives a standard size response in the whole HTML DOM tokens. It takes a parameter text that it will try to highlight from the HTML DOM, opening the relevant element structures.
Select view takes a CSS selector as an argument and returns the standard size response of a paged list of first matching elements in a simplified format from the HTML DOM.
Validate works internally like Select. It takes a CSS selector and tests it. But instead of returning the result to the agent, it checks if the result elements contain all the example texts. Based on that it either gives feedback to the agent about failure or success.
The experimental agent tests were promising, although sometimes the context window was fully consumed or went into a loop if the focus action failed to retrieve a relevant view.
Mostly the agent came up with a good CSS selector to find a set of elements from the HTML DOM. One noteworthy thing in this problem is that there is not only one working solution but multiple that satisfy the requirements.
A complete example of agent code can be found here.
A high-level overview of the state machine:
builder = WorkflowAgentBuilder() builder.add_system_message( ( "You are a helpful HTML css selector finding assistant.\n" "Assignment: Create CSS Selectors Based on Text Content.\n" "Your task is to develop CSS selectors that can target HTML elements containing specific text contents. " "You are provided with a list of example texts. Use these examples to create selectors that can identify " "elements containing these texts in a given HTML structure.\n\n" "Instructions:\n" f"- Use the provided list of examples: {examples_str}.\n" "Your goal is to create selectors that are both precise and efficient, tailored to the specific" " content and structure of the HTML elements." ) ) builder.add_state_and_transitions("INIT", {focus, select}) builder.add_state_and_transitions("SELECTED_NON_EMPTY", {focus, select, validate}) builder.add_state_and_transitions("VALIDATED", {focus, select, validate, result}) builder.add_end_state("DONE") workflow_agent = builder.build()
Coming up with the general design of a state machine based LLM agent
Most popular Large Language Models (LLMs) of today are focused on chat interfaces, where the LLM plays one part of the discussion and a user or a function plays the other part in the chat messages. llmstatemachine focuses on using the chat history as the agent's memory.
In llmstatemachine a state can enable and disable functions that the LLM can call. The state defines the transitions that the agent can take. The transition functions are responsible for changing the state.
Under the hood, this is accomplished by OpenAI function calling and enforcing the function call to happen. The current API allows only one specific function to be called. I made it so that it will call an ActionSelector function with arguments that define the actual transition function and parameters to be called. Function calling API can be considered a simple way to enforce the model to return a schema following JSON formatted output (this is stricter than the JSON mode that does not enforce a schema).
The ActionSelector also takes a parameter called thinking. This parameter is there to allow the LLM to reason in tokens before making the actual decision on what transition to take. It will also be displayed in the message history, giving the agent access to the reasoning it has already made later on.
To create descriptions for the JSON specification of a function, I made a convenient method that fulfills those descriptions directly with the help of an LLM. This worked very well and could result in its own project.
Maze Game Agent
The maze game agent was made without giving access for the agent to a complete map of the 2D maze it was exploring. The success rate for mazes with the agent was low in my tests. It ended up going into loops or the same paths again and again.
thinking: Given that I just moved from the right, and since generally the objective is to move towards the upper left corner of the maze, my next move should be upwards to get closer to the goal. action: move_up .. Stopped: at a cross-section From current location you may move: DOWN, LEFT, RIGHT #########E# # # # # # ### # ### # # ### # ##### # # # # ##### ### # # # X # # ##### ### # # #########S#
I learned that when the same LLM (GPT-4-turbo) is asked to create an agent or an algorithm to solve a 2D maze, it can easily create that. This to me is a really interesting observation. I would like to explore more patterns of LLMs as systems builders instead of directly being operational agents in them.
The complete maze example agent code can be found here. The states are dynamically generated based on available next directions.
maze_game_agent_builder = ( WorkflowAgentBuilder() .add_system_message( f"You are a player in a 2 dimensional {maze_height*2}x{maze_width*2} maze. " + "Find your way through the maze." ) .add_end_state("DONE") ) def match_directions_to_callables(direction_combination) -> set[Callable]: callables = set() direction_to_function = { "UP": move_up, "DOWN": move_down, "LEFT": move_left, "RIGHT": move_right, } for direction in direction_combination: if direction in direction_to_function: callables.add(direction_to_function[direction]) return callables for state in player.all_direction_combinations(): state_str = ":".join(sorted(state)) maze_game_agent_builder.add_state_and_transitions( state_str, match_directions_to_callables(state) ) maze_game_agent_builder.add_state_and_transitions("INIT", {start}) memory_game_agent = maze_game_agent_builder.build()
Conclusions
In summary, GPT-4-turbo, as the current forefront of publicly accessible generative AI, shows promise as a core engine for agent-based systems within a state machine framework. However, it encounters challenges with complex, high-level planning and execution.
What's fascinating is its ability to craft algorithmic agents capable of solving puzzles it can't directly tackle. Imagine a scenario where GPT-4-turbo designs an agent that easily navigates a labyrinth, a task it finds too hard. This isn't just about an AI model's knowledge of algorithms but hints at a new frontier where generative AI can be harnessed to build deterministic agents, complementing its inherently non-deterministic nature. Such a future, where AI not only solves problems but creates solvers, is not just promising — it's a glimpse into the potential of AI in our world.