Microsoft announced Jigsaw, a new tool that can improve the performance of large language models. “Large pre-trained language models (e.g., GPT-3, Codex, etc.), can be tuned to generate code from the natural language specifications intended by the programmer. Such automated models have the potential to improve the productivity of every programmer in the world; however, the quality of the generated code cannot be guaranteed because these models may have difficulty understanding program semantics.”

According to the introduction, Jigsaw deploys post-processing techniques that understand the syntax and semantics of programs and then uses user feedback to improve future performance; the tool is designed to synthesize code for the Python Pandas API using multi-modal input. Pandas is a widely used API in data science, with hundreds of functions for manipulating dataframes or tables with rows and columns.

For its part, Microsoft says its experience shows that Jigsaw can play an important role in improving system accuracy as these large language models evolve to synthesize code based on intent.


Large language models like OpenAI’s Codex are redefining the field of programming. Software developers can provide English descriptions for expected code snippets when solving programming tasks, and Codex can synthesize the expected code in languages like Python or JavaScript. With Project Jigsaw, the goal is to automate some of these reviews to increase the productivity of developers who use large language models such as Codex for code synthesis, the Jigsaw team explains.

Microsoft believes Jigsaw can “fully automate” the entire process of checking that code compiles, handling error messages and testing that code to see if it produces the output the developer wants. “Jigsaw takes as input the English description of the expected code and I/O instances. In this way, it pairs the input with the associated output; and provides the quality assurance that the output Python code will compile on the supplied input and produce the expected output.”

In its ICSE 2022 paper Jigsaw: Large Language Models meet Program Synthesis, Microsoft evaluated this approach on Python Pandas. Using Jigsaw, users can provide an English description of the intended transformation, an input dataframe, and a corresponding output dataframe, and then have Jigsaw synthesize the intended code.

Jigsaw takes the English query and preprocesses it using the appropriate context to construct input that can be fed into a large language model. In its experiments, Microsoft found that Jigsaw could create the correct output in 30% of the time. If the code fails, then the repair process begins in the post-processing phase.

During post-processing, Jigsaw applied three transformations to fix the code. Each of these transformations was motivated by the failure modes they observed in GPT-3 and Codex. And both GPT-3 and Codex fail in similar ways, so Jigsaw’s post-processing to resolve these failure modes was useful for both.

Microsoft evaluated Codex and Jigsaw (with Codex) on various datasets and measured accuracy; Codex gave about 30% out-of-the-box accuracy, while Jigsaw improved accuracy to over 60%; with user feedback, accuracy can be improved to over 80%. They will continue to work on refining Jigsaw, working to extend their experience with the Python Pandas API to other APIs and other languages; playing an important role in improving programmer productivity through automation.

For more details, check out the official blog: