Tidy data
Dorota Celińska-Kopczyńska
University of Warsaw
16 października 2018
The article
Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software, vol. 59.
Motivation
I A huge amount of effort is spent cleaning data to get it ready for analysis.
I However... little effort is devoted to research how to do it efficiently!
I Messy data sets are messy in their own way; clean data should follow similar principles.
Tidy data
I Codd’s 3rd normal form reworded in a statistical language.
I We will focus on a single data set rather than many connected data sets in a relational database.
Principles of tidy data
I Each variable forms a column.
I Each observation forms a row.
I Each type of observational unit forms a table.
Five most common problems
I Column headers are values, not variable names.
I Multiple variables are stored in one column.
I Variables are stored in both rows and columns.
I Multiple types of observational units are stored in the same table.
I A single observational unit is stored in multiple tables.
Column headers as values instead of variable names
I Usually in tabular data designed for presentation
I Sometimes it can be useful! Especially if we perform matrix operations!
Raw data, melting and molten data
I The data set in Table 4 has three variables: religion, income and frequency.
I Tidying it requires melting – we need to turn columns into rows.
I You may know this procedure under name of making wide data sets long.
I We introduce two new variables: one that contains column headings and one with concatenated data values
I The result is a molten data set.
Raw data, melting and molten data – toy example
Multiple variables stored in one column
I Melting may lead to having multiple variable names stored in one column.
I Tidying usually requires heuristics (from simple splitting on a string to regular expressions).
I Tidy data sets enable here easier work on variable values (a fewer number of combinations).
Multiple variables stored in one column – example
Multiple variables stored in one column – example
Variables stores in both rows and columns
I The most complicated form of messy data
Variables stores in both rows and columns
Multiple types in one table
I Data sets involve values collected at multiple levels, on different types of observational units.
I Solving this problem is directly related to normalization and relational model of data.
I There are few data analysis tools that work directly with relational data!
One type in multiple tables
I When we have single type of observational unit spread over multiple tables or files.
I Read the files into a list of tables.
I For each table add a new column recording the original file name – it is often a value of an important variable
I Combine all tables into a single table.
Manipulation
I Filter – subsetting or removing observations based on some condition.
I Transform – adding or modifying variables.
I Aggregation – collapsing multiple variables into a single value.
I Sort – changing the order of observations.
Manipulation – tools
I Filter – base::subset()
I Transform – base::transform() I Aggregation – plyr::summarise() I Sort – plyr::arrange()
I Working on subsets – base::by(), plyr::ddply() I Combining data sets – base::merge(), plyr::join()
Visualization
I Visualization tools only need to be input-tidy – their output is visual.
I Visualization as mapping between variables and aesthetic properties of the graph.
I There exist also tools for visualizing messy data.
Visualization – tools
I Input tidy: base::plot(), lattice, ggplot2.
I Messy input: base::barplot(), base::matplot(), base::mosaicplot()...
Modeling
I Tidy data are similar to the internal data model used in regression analysis.
I Depending on the data structure some problems may be solved using different techniques.
Homework
I Read and think about Section 5 – Case study.
I Do you find tidy data beneficial for your daily work? How?
I Think about the limitations or obstacles of using tidy data.
I Think about data sets that you used – are they tidy? If not, what should be done to make them tidy?
Discussion
I Tidy data may not be the most efficient way of storage of data.
I Maybe in case of multidimensional analysis a redefinition should be taken.
I Restructuring is only a part of the problem – how to improve other areas of cleaning data?