🤖

Building large language model-powered applications

I’ve been working on autobot, an automated code refactoring tool (”It’s like GitHub Copilot, but for your existing codebase.

Working on a new tool: autobot 🤖

Describe your desired refactor + provide an example change, and autobot will automatically generate patches to review.

It's like GitHub Copilot, but for your existing codebase.

First example: sorting class attributes. pic.twitter.com/wfENhF3gJ7
— Charlie Marsh (@charliermarsh) September 12, 2022

autobot takes an example diff as input, scans your codebase, and generates patches for you to review, where each patch represents applying the logic behind that example diff to an existing function, class definition, etc. There are lots of examples in the thread.

autobot is powered by large language models (LLMs); specifically, the text-davinci-002 model from OpenAI. This was my first time building an application atop LLMs. Here are a few observations and questions based on my experience.

1. It’s very easy to get up-and-running with LLMs via managed services.

I mostly used the OpenAI API, but I also tried out Huggingface’s Accelerated Inference API (and some of my friends use Replicate). You really don’t need to know anything about ML to build atop these services.

As an example, here’s the OpenAI code needed to generate a Python function based on a description:

import openai

openai.organization = os.environ["OPENAI_ORGANIZATION"]
openai.api_key = os.environ["OPENAI_API_KEY"]

for choice in openai.Completion.create(
    model="text-davinci-002",
    prompt="# Write a function to compute the square of two numbers.",
    temperature=0,
)["choices"]:
    print(choice["text"])

Similarly, here’s Huggingface code for the same thing (using CodeParrot, a code generation model):

API_URL = "https://api-inference.huggingface.co/models/codeparrot/codeparrot"
headers = {"Authorization": f"Bearer {os.environ['HUGGINGFACE_TOKEN']}"}


def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))


for response in query("# Write a function to compute the square of two numbers."):
    print(response["generated_text"])

OpenAI and Huggingface both provide free tiers and credits with plenty of room to play around. I used up $3.57 (out of $18.00 in credits) while building autobot.

Conversely, getting these models to run locally on my M1 was a no-go. Setting up an inference server on GCP also felt like a big ask. I’ll probably use managed services for inference for the foreseeable future. The biggest limitation I experienced is that they’re just not that fast. I mean, “Fast” is all relative — running these models at scale is of course an incredible feat of engineering — but the kinds of queries that autobot is performing atop text-davinci-002 take 5-10 seconds.

In building autobot, I was reminded of Replit’s write-up on GhostWriter. The main reason I’d see myself moving away from managed services is to optimize inference speed for a specific model in a specific context.

2. The “modeling” isn’t the bottleneck.

Most of building autobot had nothing to do with large language models or machine learning specifically. Instead, it was about taking those capabilities and packaging them up in a useful user interface and user experience.

I spent more time on the patch review workflow than I did on hooking into any of those inference APIs or testing out model variants. I spent more time figuring out how to break the user’s code into useful, editable chunks than I did on tweaking any of the prompts.

I suspect this will only become truer as LLMs advance and evolve.

3. What will testing look like for LLM-powered applications?

autobot differs from some of the other LLM-powered applications I’ve seen in that, for each prompt, it’s looking to generate a single, deterministic response (temperature=0). And in many of the contexts I’ve explored, there’s really only one right answer — e.g., if the task is “remove print statements”, then the answer should contain the input code, without any print statements.

Test-driven development actually works well for this: you know the user input, you know the desired output, and what you’re trying to define is the right prompt and associated context to coax it out of the model.

I ended up writing a test suite that encoded these expectations and let me iterate on the prompt from there:

class AutobotTest(unittest.TestCase):
    def test_useless_object_inheritance(self) -> None:
        before_snippet = """
class Foo(Bar, object):
    def __init__(self, x: int) -> None:
        self.x = x
"""

        after_snippet = """
class Foo(Bar):
    def __init__(self, x: int) -> None:
        self.x = x
"""

        snippet = """
class CreateTaskResponse(object):
    task_id: str
        """

        prompt = make_prompt(
            snippet,
            transform_type="ClassDef",
            before_snippet=before_snippet,
            after_snippet=after_snippet,
            before_description="with object inheritance",
            after_description="without object inheritance",
        )

        actual = api.create_completion(
            prompt.text, max_tokens=prompt.max_tokens, stop=prompt.stop
        )
        expected = """
class CreateTaskResponse:
    task_id: str
        """

        self.assertEqual(actual.strip(), expected.strip())

Unfortunately, your tests are relying on an external service, and one that you don’t want to mock out. Even still, I’d generally recommend this. It added some structure to a process that felt much more haphazard to start (tweak the prompt to get one case passing; move on to the next; tweak the prompt to get that case passing but break the first case in the process; and so on).

This strategy only works for deterministic models. I’m not sure what I would do if I were building something atop a non-deterministic language model and / or working in a context that allowed for multiple correct “right answers”.

4. Different language models behave very differently.

Sorry, probably obvious, but based on what I’d seen on Twitter, I kind of assumed that all of these “large” models would be sort of flawless or at least interchangeable.

For autobot, I had to use the largest OpenAI model (text-davinci-002) to get good results. The smaller models (like text-curie-001) didn’t really work. E.g., in the above case (removing object inheritance from CreateTaskResponse), text-cure-001 gave me this answer:

class CreateTaskResponse(object):
    task_id: str

    def

Similarly, I couldn’t get any of the open source code generation models (CodeParrot, Salesforce’s Codegen) to work with my prompts. Those models are clearly very powerful, so maybe my prompts are bad (or maybe they’re just designed around different kinds of tasks).

If you’re getting unsatisfying results, it’s worth trying out different models.

5. “Prompt engineering” is a piece of the puzzle.

The exact structure of the prompts behind autobot had a significant impact on suggestion quality. This process (of designing or engineering a prompt to achieve some desired model behavior) is talked about a lot on Twitter, so I was primed to expect it. But yes, it’s a necessary step in building these kinds of applications.

As an example, autobot can sort class attributes alphabetically. The original prompt behind this omitted the word “alphabetically” — it was something like "…with sorted class attributes”. Tweaking that to "...with alphabetically sorted class attributes” was the difference between useless and useful output.

(I don’t know whether prompt engineering will be important in the long run, but we’re not living in the long run…)

Limitations.

Here are a few of the limitations I ran into:

There’s a limit on the number of tokens that you can provide in a single inference request (i.e., on the size of the prompt plus the tokens it generates in response to it). The limit varies from model-to-model. This caused some issues for autobot, since I was typically asking the model to rewrite entire functions, and so the prompt had to contain the function, and the model had to generate a modified version. Since the ‘function’ is the atomic unit, autobot only works on functions below a certain character count. (I’ll have to define a different, more granular atomic unit to resolve this.)

The Huggingface Accelerated Inference API provides caching out of the box, but OpenAI does not, so I ended up building a simple caching layer for repeated queries. (autobot also deduplicates repeated queries on the client, such that if you run it against a codebase and it finds repeated snippets, it’ll only hit the OpenAI API once.)

autobot is typically asking the LLM to rewrite a function or class definition. So I’m giving the model a function or class definition, and looking to get a modified version in response. Getting the model to generate only that modified version and no additional code was challenging — typically, the model would just “keep going” and generate comments, other functions, etc., up to a certain total token count. I eventually solved this problem by adding a “Stop generating code” pragma to the prompt, asking the model to repeat it at the end, and then leveraging the stop sequence parameter in the OpenAI API.

Published on September 13, 2022.