Data-First Development with gurobipy-pandas: Speed, Best Practices and Other Considerations

Monday, September 9, 2024

In a Gurobi webinar (view the recording here) on Data-First Optimization Development, Irv discussed the transformative benefits of pandas and best practices using the gurobipy-pandas library, and walked through an example. Following is a lightly edited excerpt from Irv’s presentation and the Q&A with practitioners from around the world.

In emphasizing Data-First optimization development, I have drawn on the Job Task Analysis (JTA) created by INFORMS that serves as the basis of the Certified Analytics Professional exam. The JTA also serves as an outline for a solution development process that prescribes preparing and working with the data before creating a model; unfortunately, this is not the way optimization is typically taught in the Operations Research community, which focuses on models. Based on the JTA and the methodology at Princeton Consultants which is proven to deliver high quality optimization applications, I strongly recommend you start with the data.

I discovered pandas (https://pandas.pydata.org/) about nine years ago and realized that it was going to solve a lot of problems for me in developing optimization applications. After making a number of contributions to pandas , I was invited to join the core team, https://pandas.pydata.org/about/team.html and I participate in its ongoing improvements.

Inspired in part by some of our work in tying together pandas and Gurobi together, Gurobi built the gurobipy-pandas library, https://www.gurobi.com/features/gurobipy-pandas/, which takes advantage of pandas to manage data​, allowing Gurobi objects to be placed into pandas DataFrames and Series​. The library provides the capability to write code that executes quickly when creating models​. Good knowledge of pandas is required to map mathematical formulations into gurobipy-pandas.

We have written about this topic in previous blog posts:

 

Q&A

Question: Have you compared the performance of gurobipy-pandas against the more traditional interface or alternatives? Can you note the performance in terms of both model-building as well as how it might affect the solution time?

Irv: We were internally using pandas to create models before gurobipy‑pandas existed. Back in 2018, we had models we’d written with gurobipy with and without pandas that were larger scale, and we were seeing faster model building times using pandas. One of the real issues that comes into play has to do with slicing: pandas more naturally slices than gurobipy does because it has some very efficient ways of doing groupby operations in an efficient way that I think is better than what's under the hood in gurobipy.

I don't know how that would compare today. In fact, one of the biggest pieces of overhead at the time was the naming of the constraints, which was taking the longest amount of time. The actual creation of the model was flying by in both cases. It is useful to have names when you name your constraints and variables because it makes debugging your models a lot easier. There is an overhead for creating names that might sometimes be the most expensive part of creating a model.

Question: Are there any advantages to this method versus maybe using Pyomo with Gurobi?

Irv: In my experience, using gurobipy instead of Pyomo results in better performance. We have evaluated Pyomo and believe its design was based on AMPL and a model‑first way of thinking, which we avoid. In my earlier example, we inferred the sets of nodes and commodities from the tables, but in Pyomo you have to explicitly say what those sets are and read them in separately. A real advantage working with gurobipy-pandas is that you are thinking about and understanding your data. As you develop a model in a notebook, you can look at the data, plot it, generate different descriptive statistics, and see where missing values are. When you achieve an understanding of the data, it then plugs into the model you write. In the Pyomo approach, you are not really making that data analysis and hooking it in. And when it comes to slicing and groupby operations, the gurobipy‑pandas is going to be a lot faster than using Pyomo.

Question: When dealing with large-scale issues, do you find it more effective to define variables within a single DataFrame, maybe allocating a column for each variable, or should you opt for a  DataFrame dedicated to each variable?

Irv: It very much depends. In my simple example earlier with one set of decision variables, I used a Series. In the commercial models that we create, there is typically a mix of variables that have different index sets. If you consider the classical facility location problem in which you are deciding which facilities to open, you will have a variable that is indexed just on the facilities that you might open, and then you might have other variables that are going to be indexed on things like supply and demand, and then connections in a network. In that case, you are going to place them in separate Series or DataFrames. In the larger models we have developed at Princeton, we typically keep each set of decision variables in its own pandas Series, which allows us to keep things straight from a development perspective: when we are sharing code and collaborating, we all know that is our best practice. We can see that all the variables are in a collection of different Series that are named, however we want to name them—that is typically what we do here at Princeton.

To discuss this topic with Irv, email us to set up a call.