科学计算13 - Seaborn 高级

2018-03-15
Geng

seaborn 本身建立在 Matplotlib 之上, 它自己也有低级方法和高级方法. 低级方法直接使用 Matplotlib 操作数值数据, 范畴数据和回归. 高级方法调用低级方法绘图, 可以更快速的画图. 如果想要更多的画图自主性, 那么还是要用低级方法. 当然, 最高的自主性就是直接使用 Matplotlib, 但是比较费事.

Visualizing the distribution of a dataset

1. 单变量: displot 绘制直方图和概率密度曲线
2. 双变量: jointplot 绘制联合概率
3. 成对关系: pairplot 绘制数据的成对关系
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(sum(map(ord, "distributions")))
sns.set(color_codes=True)


绘制单变量分布 Plotting univariate distributions

x = np.random.normal(size=100)
sns.distplot(x);


直方图 Histograms

sns.distplot(x, kde=False, rug=True);


Kernel density estimaton (核密度估计, KDE)

sns.distplot(x, hist=False, rug=True);


绘制二元变量分布 Plotting bivariate distributions

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])

sns.jointplot(x="x", y="y", data=df)

可视化数据集中的成对关系 Visualizing pairwise relationships in a dataset

iris = sns.load_dataset("iris")
sns.pairplot(iris)

绘制范畴数据(分类数据)Plotting with categorical data

• 显示分类变量的每个级别的每个观察结果：swarmplot（）stripplot（）
• 显示每个观测分布的抽象表示：boxplot（）violinplot（）
• 应用统计估计来显示集中趋势和置信区间的度量：barplot（）pointplot（）

sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))


散点图 Categorical scatterplots

sns.stripplot(x="day", y="total_bill", data=tips);


sns.stripplot(x="day", y="total_bill", data=tips, jitter=True);


sns.swarmplot(x="day", y="total_bill", data=tips);


sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips);


sns.swarmplot(x="total_bill", y="day", hue="time", data=tips);


分类内数据的分布 Distributions of observations within categories

Boxplots

sns.boxplot(x="day", y="total_bill", hue="time", data=tips);


sns.boxplot(x="day", y="total_bill", data=tips);


Violinplots

sns.violinplot(x="total_bill", y="day", hue="time", data=tips);


hue 参数只有两个等级时，也可以“拆分”小提琴，这样可以更有效地利用空间：

sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True);


统计估计 Statistical estimation within categories

Bar plots

sns.barplot(x="sex", y="survived", hue="class", data=titanic);


sns.countplot(x="deck", data=titanic, palette="Greens_d");


sns.countplot(y="deck", hue="class", data=titanic);


Point plots

pointplot() 函数提供了查看相同信息的另一种方式。此函数也会将 y 轴上的高度值编码，但不是显示完整的柱形图，而是绘制点估计值和置信区间。另外，pointplot 连接相同“色调(hue)”的点。这可以很容易地看出 yx 的变化.

sns.pointplot(x="sex", y="survived", hue="class", data=titanic);


绘制多面板图Drawing multi-panel categorical plots

factorplot and FacetGrid

seaborn 的一大能力就是很容易画条件图

sns.factorplot(x="day", y="total_bill", hue="smoker", data=tips);


kind 参数可以让你选择上面讨论的任何一种图：

sns.factorplot(x="day", y="total_bill", hue="smoker", data=tips, kind="bar");


sns.factorplot(x="day", y="total_bill", hue="smoker",
col="time", data=tips, kind="swarm");  # 根据时间不同, 分布在不同列


sns.factorplot(x="day", y="total_bill", hue="smoker",
row="time", data=tips, kind="swarm");  # 根据时间不同, 分布在不同行


sns.factorplot(x="time", y="total_bill", hue="smoker",
col="day", data=tips, kind="box", size=4, aspect=.5)

factorplot 一大好处是可以一次完成所有工作，而不必自己分割数据并单独创建条件图。

FacetGrid 对象稍微复杂一点，但也更强大，采取同样的想法。假设我们想看KDE图：

g = sns.FacetGrid(data=tips, col="day")  # 建立 FacetGrid 对象, 以 day 为分类画在几个列上
g.map(sns.distplot, "total_bill")  # 使用 FacetGrid 对象方法 map(绘图方法, 数据) 绘图

titanic.head()

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
g = sns.FacetGrid(data=titanic, col="sex")
g.map(plt.scatter, "age", "fare")

FacetGrid 让我们可以将任何绘图功能映射到数据的每个部分。例如，上面我们将 plt.scatter 赋予 g.map，它告诉 Seaborn 将 matplotlib plt.scatter 函数应用于每段数据。我们可以使用任何理解输入数据的函数。例如，我们可以绘制回归图：

g = sns.FacetGrid(titanic, col="sex")
g.map(sns.regplot, "age", "fare")

g = sns.FacetGrid(titanic, col="sex", row="survived")
g.map(sns.kdeplot, "age", "fare")

pairplot and PairGrid

PairGrid 可以帮助显示数个变量的分类关系, 用法和 FacetGrid 类似：

g = sns.PairGrid(tips,
x_vars=["smoker", "time", "sex"],
y_vars=["total_bill", "tip"],
aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");


g = sns.pairplot(tips,
x_vars=["smoker", "time", "sex"],
y_vars=["total_bill", "tip"],
aspect=.75, size=3.5)
g.map(sns.violinplot, palette="pastel");


观察线性关系 Visualizing linear relationships

sns.set(color_codes=True)
np.random.seed(sum(map(ord, "regression")))


绘制线性回归模型的方法 Functions to draw linear regression models

seaborn 主要有两个方法通过回归显示线性关系。这两个方法一个是 regplot, 一个是 lmplot, 而且密切相关。

sns.regplot(x="total_bill", y="tip", data=tips);


sns.lmplot(x="total_bill", y="tip", data=tips)


sns.lmplot(x="size", y="tip", data=tips)


sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05)


sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean)


拟合不同种类的模型 Fitting different kinds of models

anscombe = sns.load_dataset("anscombe")

sns.lmplot(x="x", y="y", data=anscombe[anscombe['dataset']=='I'], ci=None, scatter_kws={"s": 80})

sns.lmplot(x="x", y="y", data=anscombe[anscombe['dataset']=='II'], ci=None, scatter_kws={"s": 80})


sns.lmplot(x="x", y="y", data=anscombe[anscombe['dataset']=='II'], order=2, ci=None, scatter_kws={"s": 80})


sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),
ci=None, scatter_kws={"s": 80})

sns.lmplot(x="x", y="y", data=anscombe[anscombe['dataset']=='III'], ci=None, scatter_kws={"s": 80})


sns.lmplot(x="x", y="y", data=anscombe[anscombe['dataset']=='III'], robust=True, ci=None, scatter_kws={"s": 80})


y 是二元变量时，简单线性回归好像也 OK，但是结果你能信吗?

tips["big_tip"] = (tips.tip / tips.total_bill) > .15
sns.lmplot(x="total_bill", y="big_tip", data=tips,
y_jitter=.03)


sns.lmplot(x="total_bill", y="big_tip", data=tips,
logistic=True, y_jitter=.03)

residplot() 函数可以检查简单回归模型是否适合数据集。它拟合并删除一个简单的线性回归，然后绘制每个观察值的残差值。理想情况下，这些值应该在 y = 0 周围随机分布：

sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),
scatter_kws={"s": 80});


sns.residplot(x="x", y="y", data=anscombe[anscombe['dataset'] == 'I'], scatter_kws={"s": 80})

sns.residplot(x="x", y="y", data=anscombe[anscombe['dataset'] == 'II'], scatter_kws={"s": 80})

条件分布 Conditioning on other variables

sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)


sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", row="sex", data=tips)

其他回归线场景 Plotting a regression in other contexts

sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg")

pairplot() 使用 kind =“reg” 作为参数, 可以结合 regplot()PairGrid 来显示数据集线性关系。

sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], size=5, aspect=.8, kind="reg")

sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], hue="smoker", size=5, aspect=.8, kind="reg")

绘制数据感知网格 Plotting on data-aware grids

For advanced use, you can use the objects discussed in this part of the tutorial directly, which will provide maximum flexibility. Some seaborn functions (such as lmplot(), factorplot(), and pairplot()) also use them behind the scenes. Unlike other seaborn functions that are “Axes-level” and draw onto specific (possibly already-existing) matplotlib Axes without otherwise manipulating the figure, these higher-level functions create a figure when called and are generally more strict about how it gets set up. In some cases, arguments either to those functions or to the constructor of the class they rely on will provide a different interface attributes like the figure size, as in the case of lmplot() where you can set the height and aspect ratio for each facet rather than the overall size of the figure. Any function that uses one of these objects will always return it after plotting, though, and most of these objects have convenience methods for changing how the plot is drawn, often in a more abstract and easy way.

sns.set(style="ticks")
np.random.seed(sum(map(ord, "axis_grids")))


The FacetGrid is an object that links a Pandas DataFrame to a matplotlib figure with a particular structure.

The FacetGrid class is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of your dataset. A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first two have obvious correspondence with the resulting array of axes; think of the hue variable as a third dimension along a depth axis, where different levels are plotted with different colors.

The class is used by initializing a FacetGrid object with a dataframe and the names of the variables that will form the row, column, or hue dimensions of the grid. These variables should be categorical or discrete, and then the data at each level of the variable will be used for a facet along that axis. For example, say we wanted to examine differences between lunch and dinner in the tips dataset.

Additionally, both lmplot() and factorplot() use this object internally, and they return the object when they are finsihed so that it can be used for further tweaking.

tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")


The main approach for visualizing data on this grid is with the FacetGrid.map() method. Provide it with a plotting function and the name(s) of variable(s) in the dataframe to plot. Let’s look at the distribution of tips in each of these subsets, using a histogram.

g = sns.FacetGrid(tips, col="time")
g.map(plt.hist, "tip")

g = sns.FacetGrid(tips, col="sex", hue="smoker")
g.map(plt.scatter, "total_bill", "tip", alpha=.7)

The default ordering of the facets is derived from the information in the DataFrame. If the variable used to define facets has a categorical type, then the order of the categories is used. Otherwise, the facets will be in the order of appearence of the category levels. It is possible, however, to specify an ordering of any facet dimension with the appropriate *_order parameter:

ordered_days = tips.day.value_counts().index
g = sns.FacetGrid(tips, row="day", row_order=ordered_days, size=1.7, aspect=4,)
g.map(sns.distplot, "total_bill", hist=False, rug=True)

Plotting pairwise relationships in a dataset

PairGrid also allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship, but PairGrid is not limited to scatterplots.

It’s important to understand the differences between a FacetGrid and a PairGrid. In the former, each facet shows the same relationship conditioned on different levels of other variables. In the latter, each plot shows a different relationship (although the upper and lower triangles will have mirrored plots). Using PairGrid can give you a very quick, very high-level summary of interesting relationships in your dataset.

The basic usage of the class is very similar to FacetGrid. First you initialize the grid, then you pass plotting function to a map method and it will be called on each subplot. There is also a companion function, pairplot() that trades off some flexibility for faster plotting.

g = sns.PairGrid(iris)
g.map(plt.scatter)


It’s possible to plot a different function on the diagonal to show the univariate distribution of the variable in each column. Note that the axis ticks won’t correspond to the count or density axis of this plot, though.

g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)

A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of three different species of iris flowers so you can see how they differ.

g = sns.PairGrid(iris, hue="species")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)

By default every numeric column in the dataset is used, but you can focus on particular relationships if you want.

g = sns.PairGrid(iris, vars=["sepal_length", "sepal_width"], hue="species")
g.map(plt.scatter)

It’s also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship.

g = sns.PairGrid(iris)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_diag(sns.kdeplot, lw=3, legend=False);

g = sns.PairGrid(tips, y_vars=["tip"], x_vars=["total_bill", "size"], size=4)
g.map(sns.regplot)
g.set(ylim=(-1, 11), yticks=[0, 5, 10])

PairGrid is flexible, but to take a quick look at a dataset, it can be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal).

sns.pairplot(iris, hue="species", size=2.5)

You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking.

g = sns.pairplot(iris, hue="species", palette="Set2", diag_kind="kde", size=2.5)


设置绘图显示效果 Controlling figure aesthetics

Matplotlib is highly customizable, but it can be hard to know what settings to tweak to achieve an attractive plot. Seaborn comes with a number of customized themes and a high-level interface for controlling the look of matplotlib figures.

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
np.random.seed(sum(map(ord, "aesthetics")))


Let’s define a simple function to plot some offset sine waves, which will help us see the different stylistic parameters we can tweak.

def sinplot(flip=1):
x = np.linspace(0, 14, 100)
for i in range(1, 7):
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)

sinplot()


# To switch to seaborn defaults, simply call the set() function.
sns.set()
sinplot()


Seaborn splits matplotlib parameters into two independent groups. The first group sets the aesthetic style of the plot, and the second scales various elements of the figure so that it can be easily incorporated into different contexts.

The interface for manipulating these parameters are two pairs of functions. To control the style, use the axes_style() and set_style() functions. To scale the plot, use the plotting_context() and set_context() functions. In both cases, the first function returns a dictionary of parameters and the second sets the matplotlib defaults.

Seaborn figure styles

There are five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks. They are each suited to different applications and personal preferences. The default theme is darkgrid. As mentioned above, the grid helps the plot serve as a lookup table for quantitative information, and the white-on grey helps to keep the grid from competing with lines that represent data. The whitegrid theme is similar, but it is better suited to plots with heavy data elements:

sns.set_style("whitegrid")
data = np.random.normal(size=(20, 6)) + np.arange(6) / 2
sns.boxplot(data=data);


Removing axes spines

Both the white and ticks styles can benefit from removing the top and right axes spines, which are not needed. It’s impossible to do this through the matplotlib parameters, but you can call the seaborn function despine() to remove them:

sns.set_style("ticks")
sns.set_context("poster")
sinplot()
sns.despine()


Choosing color palettes

current_palette = sns.color_palette()
sns.palplot(current_palette)