Over a year ago, I discovered a wonderful article by Hadley Wickham introducing the concept of "Tidy Data." Here is a complete reference:
Hadley Wickham, "Tidy Data", Journal of Statistical Software, Volume 59(10), August 2014.
Wickham, a major contributor to the open source R package for statistical analysis, was addressing the needs of statisticians in his paper. The paper gives examples in R, and my former IBM colleague, Jean-François Puget, has written a blog post demonstrating how to create tidy data in Python, which is my preferred programming environment along with the pandas library. When solving mathematical optimization problems, it is useful to work with data in a tidy form. However, the definition given by Wickham is not well suited for optimization data, so I have taken the liberty of creating my own definition of Tidy Data for Optimization. These are contrasted in the table below, where the key differences in wording are highlighted.
Wickham Definition for Statistics | Lustig Definition for Optimization |
"Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: | Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with records, parameters and types. In tidy data: |
1. Each variable forms a column. | 1. Each parameter forms a column. |
2. Each observation forms a row. | 2. Each record forms a row. |
3. Each type of observational unit forms a table." |
3. Each type of data unit sharing an index forms a table. |
4. A MultiIndex of each record implicitly defines multiple sets. |
An interesting aspect of optimization that is different from the statistical world is the need for tables that have indexes. In the world of mathematical programming languages such as AMPL, OPL, GAMS, AIMMS and MPL, the user of those languages is responsible for defining the indexes as sets, and then defining different data structures indexed by those sets. When I receive data from clients, the data is most often output from a database, including its index, so that the index and the resulting data are defined in the data. If the data is in a spreadsheet, the indices are often row names and column names, which may need to be reshaped into a tidy form for use in an optimization model. Working with data in a tidy form when developing an optimization model is useful, because it allows the model to naturally scale as the dimension of different index sets grow.
One of my main issues with the mathematical programming languages is that they require the developer to declare the schema of data, and then use different methods to import that data into the modeling language environment. I've always been frustrated by this, as I'd receive data in CSV files, spreadsheets and databases, each implicitly defining the data schema within the data itself. I would spend lots of effort recoding the data schema that was given in the source data files into modeling language declarations. Now I use python and pandas, which eliminates the need to define a data schema in my modeling environment. pandas makes it natural to start with source data files and let the data definitions define the schema of the data. I will discuss this in future blog entries, where I will show the advantages of using pandas to read in data from various data sources, transform that data in preparation for an optimization model, and to manage the entities of an optimization model.
Here are two examples that illustrate tidy data for optimization. The first is an excerpt from an AMPL model provided with the AMPL distribution. One file, steelT.mod, has the declarations of the data model:
set PROD; # products
param T > 0; # number of weeks
param revenue {PROD,1..T} >= 0;
param market {PROD,1..T} >= 0;
The data file, steelT.dat, has the data in a non-tidy form:
param T := 4;
set PROD := bands coils;
param revenue: 1 2 3 4 := bands 25 26 27 27 coils 30 35 37 39 ; param market: 1 2 3 4 := bands 6000 6000 4000 6500 coils 4000 2500 3500 4200 ;
This data is not in a tidy form. The tidy form of the same data could come from a CSV file, a database or a spreadsheet, and would be formatted as follows:
market | revenue | ||
PROD | T | ||
bands | 1 | 6000 | 25 |
coils | 1 | 4000 | 30 |
bands | 2 | 6000 | 26 |
coils | 2 | 2500 | 35 |
bands | 3 | 4000 | 27 |
coils | 3 | 3500 | 37 |
bands | 4 | 6500 | 27 |
coils | 4 | 4200 | 39 |
Here, the first 2 columns (named "PROD" and "T") define the index for each row. In this particular case, because the index is defined using two values, I borrow the term MultiIndex from pandas to define a multi-dimensional index. The two parameters, "market" and "revenue", define each column. Finally, the index is implicitly defining 3 sets:
1. The unique indices of the table: { ('bands', 1),('coils', 1),('bands', 2),('coils', 2), ('bands', 3),('coils', 3),('bands', 4),('coils', 4) } 2. The set "PROD":{'bands', 'coils'} 3. The set "T": {1, 2, 3, 4}
Each of those sets is useful when creating decision variables for an optimization model.
Our second example comes the IBM CPLEX Optimization Studio distribution, using OPL. Part of the model file knapsack.mod reads:
int NbItems = ...;
int NbResources = ...;
range Items = 1..NbItems;
range Resources = 1..NbResources;
int Use[Resources][Items] = ...;
The corresponding data in the data file knapsack.dat reads:
NbResources = 7;
NbItems = 12;
Use = [
[ 19, 1, 10, 1, 1, 14, 152, 11, 1, 1, 1, 1 ],
[ 0, 4, 53, 0, 0, 80, 0, 4, 5, 0, 0, 0 ],
[ 4, 660, 3, 0, 30, 0, 3, 0, 4, 90, 0, 0],
[ 7, 0, 18, 6, 770, 330, 7, 0, 0, 6, 0, 0],
[ 0, 20, 0, 4, 52, 3, 0, 0, 0, 5, 4, 0],
[ 0, 0, 40, 70, 4, 63, 0, 0, 60, 0, 4, 0],
[ 0, 32, 0, 0, 0, 5, 0, 3, 0, 660, 0, 9]];
In this example, the data for the matrix Use is sparse, as most of the values are 0. A challenge with this representation is that as the number of resources and items grows, the size of data in this representation grows much faster than the number of nonzeros in the Use matrix. The tidy representation takes care of that. Here is an excerpt from that representation:
Resources | Items | Use |
R1 | I1 | 19 |
I10 | 1 | |
I11 | 1 | |
I12 | 1 | |
I2 | 1 | |
I3 | 10 | |
I4 | 1 | |
I5 | 1 | |
I6 | 14 | |
I7 | 152 | |
... | ... | ... |
R6 | I3 | 40 |
I4 | 70 | |
I5 | 4 | |
I6 | 63 | |
I9 | 60 | |
R7 | I10 | 660 |
I12 | 9 | |
I2 | 32 | |
I6 | 5 | |
I8 | 3 |
Here, the records are grouped by the names of the resources, and only the nonzero items are specified.
As a preview of some future blog entries, let's assume the data for the AMPL model was given in a tidy form in a CSV file named 01.tidysteel.csv.
PROD, T, market, revenue bands, 1, 6000, 25 coils, 1, 4000, 30 bands, 2, 6000, 26 coils, 2, 2500, 35 bands, 3, 4000, 27 coils, 3, 3500, 37 bands, 4, 6500, 27 coils, 4, 42000, 39
You can download a Jupyter notebook here:
https://github.com/PrincetonConsultants/pandas-Optimization-Tutorial/blob/master/01.tidysteel.ipynb
containing Python and pandas code to read in that file,
and the data file here:
https://github.com/PrincetonConsultants/pandas-Optimization-Tutorial/blob/master/01.tidysteel.csv
Feel free to email me with questions or comments.