Princeton’s Rapid Optimization Model Development Approach with python and pandas

Friday, August 9, 2019

At Princeton Consultants, we have started to do the majority of our optimization model development using python with the pandas library (https://pandas.pydata.org/). We have successfully developed models using the Gurobi (https://www.gurobi.com/) python interface, and the docplex (http://ibmdecisionoptimization.github.io/docplex-doc/) python interface for IBM ILOG CPLEX Optimization Studio (https://www.ibm.com/products/ilog-cplex-optimization-studio). Based on these successes, we have codified our approach for rapid optimization model development. Our steps for success include:

  1. Understand the data that will be required to do successful optimization. In this step, we work with our clients to make sure that we understand what data is available, the semantic meaning of that data, how they organize that data, and how different pieces of data are related to each other. This is the most important step in our optimization projects, because without a deep understanding of the data available for optimization, we cannot build a good model.
  2. Define and document datasets format that will be needed for optimization consisting of a set of tables in a tidy data. The concept of tidy data for optimization is something that I wrote about in a past blog at https://www.princetonoptimization.com/blog/blog/tidy-data-optimization-1. When we create these definitions, we define mathematical notation to represent the data items. We use Markdown documents where we can include LaTeX notation, using either Visual Studio Code or Jupyter notebooks as a development environment. The mathematical notation includes defining sets and data elements indexed by those sets. For defining sets, we prefer using LaTeX \mathcal prefixing upper case letters so they look something like this: \(\mathcal{R}\). We use the corresponding lower case letters to define individual indices in a set, and lower case greek letters to define data associated with indices of those sets. For example, an index set labeled as \(\mathcal{R}\) might have a data vector \(\alpha\) for which the individual elements of the vector are written as \(\alpha_r\).
  3. Our next step is to collect data from the client in their format (labeling it the “raw” data format) and write python and pandas code to transform that data into the tidy data format defined in step 2. This step provides a form of data auditing to make sure that the data we need for optimization is truly available. For example, if we are writing a model that uses a time index \(t\), and the raw data is provided as calendar dates and times, we need to transform the raw data into time indices via some agreed upon methodology for converting absolute times into time buckets. As we work on this code, we sometimes need to update our documentation of the tidy data format that we are expecting for optimization. We call this the “model data” format.
  4. With a better understanding of the data that can be collected for optimization, we then move on to documenting data transformations of the tidy data that might be needed as input to the optimization model. For example, if we have two input tables indexed on the same set \(\mathcal{R}\) with each element denoted as \(\alpha_r\) and \(\beta_r\), and we need to compute the sum of the minimum of the two quantities denoted as \(\gamma\), we would write in our documentation:

\[ \gamma=\sum_{r\in\mathcal{R}} m i n{\{}\alpha_r,\beta_r\} \]

  1. With all the data inputs and data transformations needed for optimization now documented, we then write the documentation of our optimization model, including decision variables, an objective function and constraints. Each decision variable and constraint will be indexed on a collection of sets defined in our data documentation, and, for decision variables, we use lower case letters to document each one. Along with each definition of a decision variable and constraints, we write a verbal description of their purpose.
  2. Now that our data inputs, data transformation and model have been documented, we are now ready to write our optimization code. We use python and pandas to read in the data and do any of the necessary transformations. Each table consisting of tidy data can be read in using one line of python code with pandas. We place each set of decision variables in a pandas Series indexed on an appropriate set. We then implement each set of constraints one at a time, and after writing a set of constraints, we study the LP file representation of that constraint as well as optimize the problem to make sure the problem remains feasible as we add each set of constraints. After all constraints are implemented, we then add the objective function, and can then start analyzing the results of our model to see if it is obtaining our expected results. The advantage of adding the objective function last is that with all constraints implemented, we know that our model is feasible.
  3. With respect to constraints, we use the power of the pandas groupby() operator, pd.concat() function and the DataFrame itertuples() method to write constraints. This is best illustrated by an example:

Let’s say that the input data is \(\alpha_i\), represented as a pandas table alpha with index [IVals], indexed over a set \(\mathcal I\). Suppose that there are decision variables \(x_{ij}\) and \(y_{ik}\) represented as a pandas Series with indices [IVals,JVals] and [IVals,KVals] respectively and that we want to write a constraint such as: \[ \sum_{j} x_{ij}+\sum_{k} y_{ik}=\alpha_i\qquad\forall i\in\mathcal I \] With pandas, we use groupby() on the constraint index (in this case, corresponding to the index IVals) to do sums with pd.concat() to line up the different components of the constraint as follows (using Gurobi’s grb.quicksum() as an example):

Putting that together with the itertuples() method allows us to iterate through the resulting pandas data structure to get the components of the constraint (using Gurobi as an example):

We end up repeating this design pattern throughout optimization code to state constraints. It also serves as a useful paradigm for finding places where different parts of a constraint are not lined up correctly, with either missing variables or missing data. Finally, by using the pandas MultiIndex paradigm to index sparse sets, we efficiently work with sparse data, sparse decision variable constructs and sparse constraint constructs.

In conclusion, we have found that by following these steps, it allows us to quickly turn raw data and business requirements into optimization models that bring real business value to our clients.