Introduction to the Julia programming language

14 Data Frames¶

In [1]:
using CSV, DataFrames
ENV["DATAFRAMES_ROWS"] = 6;
In [2]:
magic_data = CSV.read(joinpath("data", "magic04_data.txt"), DataFrame)
19020×11 DataFrame
19014 rows omitted
RowfLengthfWidthfSizefConcfConc1fAsymfM3LongfM3TransfAlphafDistclass
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64String1
128.796716.00212.64490.39180.198227.700422.011-8.202740.09281.8828g
231.603611.72352.51850.53030.377326.272223.8238-9.95746.3609205.261g
3162.052136.0314.06120.03740.0187116.741-64.858-45.21676.96256.788g
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
1901875.445547.53053.44830.14170.0549-9.356141.0562-9.466230.2987256.517h
19019120.51376.90183.99390.09440.06835.8043-93.5224-63.838984.6874408.317h
19020187.18153.00143.20930.28760.1539-167.312-168.45631.475552.731272.317h
In [20]:
names(magic_data)
11-element Vector{String}:
 "fLength"
 "fWidth"
 "fSize"
 "fConc"
 "fConc1"
 "fAsym"
 "fM3Long"
 "fM3Trans"
 "fAlpha"
 "fDist"
 "class"

Accessing Data¶

The template for accessing data from a DataFrame is:

my_data[selected_rows, selected_columns]

There are a few different patterns for this, but the template is always the same.

Extracting data (without copying) works like this:

In [3]:
magic_data[!, [:fSize]]
19020×1 DataFrame
19014 rows omitted
RowfSize
Float64
12.6449
22.5185
34.0612
⋮⋮
190183.4483
190193.9939
190203.2093

This is the recommended way to do this, although magic_data.fSize and magic_data[!, "fSize"] will also work

Copying Data¶

If a : notation is used for the row selection, then a copy of the data is made:

In [8]:
# Select the given columns from rows 1 to 5
mini_magic_data = magic_data[1:5, [:fLength, :fWidth, :fSize]]
5×3 DataFrame
RowfLengthfWidthfSize
Float64Float64Float64
128.796716.00212.6449
231.603611.72352.5185
3162.052136.0314.0612
423.81729.57282.3385
575.136230.92053.1611

One can use an appropriate row vector to set any row in the data frame:

In [9]:
mini_magic_data[3, 1:3] = [160., 136., 4.]
mini_magic_data
5×3 DataFrame
RowfLengthfWidthfSize
Float64Float64Float64
128.796716.00212.6449
231.603611.72352.5185
3160.0136.04.0
423.81729.57282.3385
575.136230.92053.1611

Selection from bool¶

A powerful way to select data is to select rows on a boolean vector constructed from the data frame itself, e.g., to select all rows that are signal events do the following.

(Below we explain why you need to use .== to broadcast the comparison.)

In [6]:
magic_data[magic_data.class .== "g", [:fLength, :fWidth, :fSize]]
12332×3 DataFrame
12326 rows omitted
RowfLengthfWidthfSize
Float64Float64Float64
128.796716.00212.6449
231.603611.72352.5185
3162.052136.0314.0612
⋮⋮⋮⋮
1233022.091310.89492.2945
1233156.221618.70192.9297
1233231.512519.28672.9578

Broadcast Assignment¶

To broadcast operations across a data frame, we use Julia's .= operation

In [10]:
mini_magic_data[!, :fSize] .*= 1000
mini_magic_data
5×3 DataFrame
RowfLengthfWidthfSize
Float64Float64Float64
128.796716.00212644.9
231.603611.72352518.5
3160.0136.04000.0
423.81729.57282338.5
575.136230.92053161.1

Adding New Data¶

Adding new data to a data frame is just a matter of assigning to a new column (using the Julia symbol for the name is useful)

In [26]:
mini_magic_data[:, :name] = ["alice", "bob", "ciarn", "dinah", "elmer"]
mini_magic_data
5×4 DataFrame
RowfLengthfWidthfSizename
Float64Float64Float64String
128.796716.00212644.9alice
231.603611.72352518.5bob
3160.0136.04000.0ciarn
423.81729.57282338.5dinah
575.136230.92053161.1elmer

Event selection¶

The first thing we might want to do is ensure that we can select events that match some particular criteria - for that we can use the subset function.

Usually one would not want to bother with a named function for these kind of trivial selections - use an anonymous function:

In [34]:
subset(mini_magic_data, [:fLength, :fWidth] => (l, w) -> (l .>= 30) .&& (w .> 10))
3×5 DataFrame
RowfLengthfWidthfSizenamefArea
Float64Float64Float64StringFloat64
131.603611.72352518.5bob370.505
2160.0136.04000.0ciarn21760.0
375.136230.92053161.1elmer2323.25

Can also use of course be done like so:

In [43]:
mini_magic_data[mini_magic_data.fLength .> 30 .&& mini_magic_data.fWidth .> 10, :]
3×5 DataFrame
RowfLengthfWidthfSizenamefArea
Float64Float64Float64StringFloat64
131.603611.72352518.5bob370.505
2160.0136.04000.0ciarn21760.0
375.136230.92053161.1elmer2323.25

Derived Data¶

For some analysis, it's pretty useful to add derived values, which we know how to do:

In [29]:
transform!(mini_magic_data, [:fLength, :fWidth] => ByRow((l, w) -> l * w) => :fArea)
5×5 DataFrame
RowfLengthfWidthfSizenamefArea
Float64Float64Float64StringFloat64
128.796716.00212644.9alice460.808
231.603611.72352518.5bob370.505
3160.0136.04000.0ciarn21760.0
423.81729.57282338.5dinah227.997
575.136230.92053161.1elmer2323.25

Map data to different values¶

In [11]:
magic_data.class = (l -> l == "g" ? 1 : 0).(magic_data.class)
magic_data
19020×11 DataFrame
19014 rows omitted
RowfLengthfWidthfSizefConcfConc1fAsymfM3LongfM3TransfAlphafDistclass
Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Int64
128.796716.00212.64490.39180.198227.700422.011-8.202740.09281.88281
231.603611.72352.51850.53030.377326.272223.8238-9.95746.3609205.2611
3162.052136.0314.06120.03740.0187116.741-64.858-45.21676.96256.7881
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
1901875.445547.53053.44830.14170.0549-9.356141.0562-9.466230.2987256.5170
19019120.51376.90183.99390.09440.06835.8043-93.5224-63.838984.6874408.3170
19020187.18153.00143.20930.28760.1539-167.312-168.45631.475552.731272.3170

Transform, Select, Combine, GroupBy, Filter¶

Just as a short summary of the data frame manipulation functions we met:

Function Description
transform Apply a transformation operation to one or more columns, return all columns plus any new ones
select Apply a transformation operation to one or more columns, only return columns that are selected, in the order requested
combine Apply a transformation operation, then collapse the result for identical output rows
groupby Split a data frame into pieces according to a certain criterion
filter Apply a selection operation to a data frame - argument order follows the method convention

The use of groupby and combine allows us to powerfully manipulate data in Julia using the well known Split, Combine, Apply strategy, originally introduced for S.

Visulization¶

In [31]:
using Plots
layout = @layout [a b c d e; f g h i j]
p = plot(layout = layout, legend = :topright, size = (800, 300))

for (i, col) in enumerate(names(magic_data)[1:end-1])
    magic_data_signal = filter(:class => l -> l==1, magic_data)
    magic_data_backgr = filter(:class => l -> l==0, magic_data)
    stephist!(magic_data_signal[!, col], title=col, normalize=true, subplot=i, label="signal")
    stephist!(magic_data_backgr[!, col], normalize=true, subplot=i, label="bckgr")
end

p