16października2018 DorotaCelińska-Kopczyńska Tidydata

(1)

Tidy data

Dorota Celińska-Kopczyńska

University of Warsaw

16 października 2018

(2)

The article

Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software, vol. 59.

(3)

Motivation

I A huge amount of effort is spent cleaning data to get it ready for analysis.

I However... little effort is devoted to research how to do it efficiently!

I Messy data sets are messy in their own way; clean data should follow similar principles.

(4)

Tidy data

I Codd’s 3rd normal form reworded in a statistical language.

I We will focus on a single data set rather than many connected data sets in a relational database.

(5)

Principles of tidy data

I Each variable forms a column.

I Each observation forms a row.

I Each type of observational unit forms a table.

(6)

Five most common problems

I Column headers are values, not variable names.

I Multiple variables are stored in one column.

I Variables are stored in both rows and columns.

I Multiple types of observational units are stored in the same table.

I A single observational unit is stored in multiple tables.

(7)

Column headers as values instead of variable names

I Usually in tabular data designed for presentation

I Sometimes it can be useful! Especially if we perform matrix operations!

(8)

Raw data, melting and molten data

I The data set in Table 4 has three variables: religion, income and frequency.

I Tidying it requires melting – we need to turn columns into rows.

I You may know this procedure under name of making wide data sets long.

I We introduce two new variables: one that contains column headings and one with concatenated data values

I The result is a molten data set.

(9)

Raw data, melting and molten data – toy example

(10)

Multiple variables stored in one column

I Melting may lead to having multiple variable names stored in one column.

I Tidying usually requires heuristics (from simple splitting on a string to regular expressions).

I Tidy data sets enable here easier work on variable values (a fewer number of combinations).

(11)

Multiple variables stored in one column – example

(12)

Multiple variables stored in one column – example

(13)

Variables stores in both rows and columns

I The most complicated form of messy data

(14)

Variables stores in both rows and columns

(15)

Multiple types in one table

I Data sets involve values collected at multiple levels, on different types of observational units.

I Solving this problem is directly related to normalization and relational model of data.

I There are few data analysis tools that work directly with relational data!

(16)

One type in multiple tables

I When we have single type of observational unit spread over multiple tables or files.

I Read the files into a list of tables.

I For each table add a new column recording the original file name – it is often a value of an important variable

I Combine all tables into a single table.

(17)

Manipulation

I Filter – subsetting or removing observations based on some condition.

I Transform – adding or modifying variables.

I Aggregation – collapsing multiple variables into a single value.

I Sort – changing the order of observations.

(18)

Manipulation – tools

I Filter – base::subset()

I Transform – base::transform() I Aggregation – plyr::summarise() I Sort – plyr::arrange()

I Working on subsets – base::by(), plyr::ddply() I Combining data sets – base::merge(), plyr::join()

(19)

Visualization

I Visualization tools only need to be input-tidy – their output is visual.

I Visualization as mapping between variables and aesthetic properties of the graph.

I There exist also tools for visualizing messy data.

(20)

Visualization – tools

I Input tidy: base::plot(), lattice, ggplot2.

I Messy input: base::barplot(), base::matplot(), base::mosaicplot()...

(21)

Modeling

I Tidy data are similar to the internal data model used in regression analysis.

I Depending on the data structure some problems may be solved using different techniques.

(22)

Homework

I Read and think about Section 5 – Case study.

I Do you find tidy data beneficial for your daily work? How?

I Think about the limitations or obstacles of using tidy data.

I Think about data sets that you used – are they tidy? If not, what should be done to make them tidy?

(23)

Discussion

I Tidy data may not be the most efficient way of storage of data.

I Maybe in case of multidimensional analysis a redefinition should be taken.

I Restructuring is only a part of the problem – how to improve other areas of cleaning data?