IARCS Verification Seminar Series


Title: Program Synthesis for Data Science

Speaker: Rohan Bavishi    (bio) (bio)


Rohan Bavishi is a PhD candidate at UC Berkeley advised by Koushik Sen, and is a member of the Programming Systems group. His research interests lie in developing low-code synthesis-based solutions to help improve productivity of developers and data scientists. His work has been published at PL and SE venues such as OOPSLA, FSE, and ASE. Rohan obtained his M.S. degree from UC Berkeley in 2019 and B.Eng. from the Indian Institute of Technology (IIT) Kanpur in 2017.


When: Tuesday, 04 January 2022 at 1900 hrs (IST)Slides  Video  

Abstract:
The Python ecosystem has emerged as a very popular platform for data science, as it offers a number of powerful tools, in the form of libraries and APIs, for accessing data, data processing, modeling, visualization and machine learning. However, these tools present a steep learning curve for data scientists and developers. Their APIs contain hundreds of functions, have dense documentation, and often lack sufficient examples. This presents a huge opportunity for reducing the barrier to entry and improving the productivity of data scientists by synthesizing code using these APIs from high-level, easy-to-provide user intent specifications.

In this talk, I will outline three systems - AutoPandas, Gauss, and VizSmith that we developed for synthesizing code for two core data science operations, namely table transformations and data visualization. All three represent different decision points along the three main controllable dimensions of synthesis - specification modality or user intent format, search space, and search algorithm, informed by the domain at hand. AutoPandas accepts I/O tables from the user, and synthesizes Pandas-based table transformation code. It combines the idea of generators from the program testing community with graph neural networks to speed up enumerative search. Gauss addresses the pain-points of using input-output examples for table transformations from a user's perspective. It uses a new UI-based interaction mechanism that captures more information than plain I/O tables, but requires less effort from the user. It also employs novel graph-based reasoning algorithms to speed up search performance by 10x on average, as compared to I/O example-based SoTA systems. Synthesizing visualizations presents a significantly different challenge. It is difficult to precisely capture user intent as machine-checkable specs, like I/O examples. VizSmith accepts keyword-oriented natural language queries along with the data columns to visualize, and synthesizes multiple visualizations which users can easily browse and pick from. It employs a novel code-mining and program analysis based approach to crowdsource its search space of visualizations. The three systems have been published at OOPSLA '19, '21 and ASE '21 respectively.

Lastly, I will discuss some of our ongoing work in this space. I will also outline some discussion points on the future of synthesis for such domains, especially in the context of large language models such as Codex and GPT-3, which have demonstrated impressive synthesis capabilities.