Quickstart

In this first tutorial we load the petrophysical properties dataset and use matplotlib to visualize the data space. start by importing the numpy, pandas and matplotlib librairies.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Importing data

[4]:
url = 'https://gist.githubusercontent.com/clberube/786a5774acfc3137039288d1a6d84fc0/raw/dc59ee0104b02141a9ca000ef52faf3d854387a1/MalarticRockPhysics.csv'
df = pd.read_csv(url)

First let’s inspect some basic properties of this simple data file, like its shape (n_rows, n_columns), its index and its column names.

[10]:
print(df.shape)
(845, 3)
[12]:
df.head()
[12]:
Lithology RockDensity log10_MagSusceptibility
0 Meta-sedimentary_rock 2.755 -3.573489
1 Meta-sedimentary_rock 2.742 -3.131356
2 Meta-sedimentary_rock 2.829 -2.058489
3 Felsic-intermediate_intrusive_rock 2.674 -4.359519
4 Meta-sedimentary_rock 2.786 -3.434152

The data set contains 845 rows and 3 columns. The first column Lithology is a description of each sample’s rock type. The second column RockDensity is the skeletal density of the rock samples expressed in g/cm\(^3\). The third column log10_MagSusceptibility is the logarithm base 10 value of their magnetic susceptibility (measured in SI units). Each rock sample is labeled with an index number (from 0 to 844).

Data distributions

This dataset contains three features, or properties, which have been observed on each rock sample. The first feature is the Lithology column. It contains text values that correspond to categories of rock types. Let’s find out how each category of rock type is represented in the dataset.

[15]:
df['Lithology'].value_counts()
[15]:
Meta-sedimentary_rock                 583
Mafic_dyke                            140
Felsic-intermediate_intrusive_rock    122
Name: Lithology, dtype: int64

The Meta-sedimentary_rock category is the most common one, with 583 rock samples in it. Other rock types include Mafic_dyke and Felsic-intermediate_intrusive_rock. The second feature is RockDensity. This is a continuous variable, we can visualize its distribution using a histogram.

[98]:
df.hist(column='RockDensity', bins=25, grid=False)
[98]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11c9a28d0>]],
      dtype=object)
../_images/basics_quickstart_14_1.png

We can also visualize the rock density distribution after grouping the dataset by rock type.

[52]:
df.hist(column='RockDensity', by='Lithology')
[52]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11a674320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11499cba8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x114a75320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x114917e80>]],
      dtype=object)
../_images/basics_quickstart_16_1.png

The third feature log10_MagSusceptibility is log-transformed. This is a useful data transformation when the values of a variable can vary over several orders of magnitude. Let’s visualize the distribution of the rock samples magnetic susceptibility to understand what happens when this property is log-transformed.

[99]:
log_ms = df['log10_MagSusceptibility']

fig, ax = plt.subplots(1, 2, figsize=(8, 3))
ax[0].hist(10**log_ms)
ax[0].set_title('raw values')
ax[1].hist(log_ms)
ax[1].set_title('log-transformed')
[99]:
Text(0.5, 1.0, 'log-transformed')
../_images/basics_quickstart_18_1.png

Performing a log-transform allows the magnetic susceptibility data to be spread out more evenly in its range of possible values.

The feature space

The feature space, or data space, is the 2D space in which our RockDensity and log10_MagSusceptibility measurements lie. It is straightforward to visualize this space with a simple scatter plot.

[79]:
fig, ax = plt.subplots()
ax.scatter(df['RockDensity'], df['log10_MagSusceptibility'], marker='.')
ax.set_xlabel('Rock Density (g/cm$^3$)')
ax.set_ylabel('Magnetic susceptibility ($\log_{10}$ SI)')
[79]:
Text(0, 0.5, 'Magnetic susceptibility ($\\log_{10}$ SI)')
../_images/basics_quickstart_22_1.png

Even better, we could use a color code to identify each point in the scatter plot by its lithology.

[101]:
fig, ax = plt.subplots()
groups = df.groupby('Lithology')
for name, group in groups:
    ax.scatter(group['RockDensity'],
               group['log10_MagSusceptibility'],
               marker='.',
               label=name)
ax.legend()
ax.set_xlabel('Rock Density (g/cm$^3$)')
ax.set_ylabel('Magnetic susceptibility ($\log_{10}$ SI)')
[101]:
Text(0, 0.5, 'Magnetic susceptibility ($\\log_{10}$ SI)')
../_images/basics_quickstart_24_1.png

It is clear from the previous figure that the various rock types have contrasting physical properties. Navigate through the various tutorials on this site to further explore the relationships between lithology, magnetic susceptibility and rock density.