认识数据分析

Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indicators.

可视化通常在数据科学和模型构建过程中起着最小的作用,但是“探索性数据分析”的创建者Tukey特别提倡大量使用可视化来解决数字指标的局限性。

Everyone’s heard — and understands — a picture equals a thousand words, and following this logic, a visualization of the data is worth at least as much as dozens of statistical metrics, from quartiles to means to standard deviations to mean absolute errors to kurtosis to entropy. Wherever there is an abundance of data, it is best understood when it is visualized.

每个人都能听到并理解,一幅图片等于一千个单词,按照这种逻辑,数据的可视化至少值几十种统计指标,从四分位数到均值到标准差再到绝对误差,峰度到熵。 无论何时何地都有大量数据,最好以可视化方式理解。

Exploratory Data Analysis was created to investigate the data, emphasizing visualization because it was more informative. This short article will present one of the most useful tools in visual EDA and how to interpret it.

创建了探索性数据分析来研究数据,并强调可视化,因为它更具信息性。 这篇简短的文章将介绍可视化EDA中最有用的工具之一,以及如何解释它。

Seaborn’s pairplot is magical: at its most simple, it gives us a rich and informational visual representation of univariate and bivariate relationships within the data. For instance, consider two pairplots below, created with one line of code, sns.pairplot(data) (the second adding hue=’species’ as a parameter).

Seaborn的pairplot是不可思议的:最简单的说,它为我们提供了数据中单变量和双变量关系的丰富且信息化的视觉表示。 例如,考虑下面的两个pairplot ,它们由一行代码sns.pairplot(data) (第二个将hue='species'作为参数添加)。

Image for post

There’s so much information to be gleaned about the data, be it the success of classification (how much entropy/overlap is there between classes), potential results of a feature selection process, variance, and what the best choice of model may be, based on these observed attributes. The pairplot is like an unfolding of multidimensional space.

有关数据的信息太多,包括分类是否成功(类别之间存在多少熵/重叠),特征选择过程的潜在结果,方差以及最佳模型选择,这些观察到的属性。 对图就像多维空间的展开。

Usually, people stop at the one-liner pairplot, but with a few more lines or even words of code, we can reap even more information and insights.

通常,人们会停留在单线对图上,但是只要再增加几行甚至是代码的话,我们就可以获取更多的信息和见解。

For one, pairplots can get notoriously large. To select a subset of the variables to be displayed, use the vars parameter, which can be set to a list of variable names. For instance, sns.pairplot(data,vars=[‘a’,’b’]) would only give the relationships between the two columns ‘a’ and ‘b’, being aa, ab, ba, and bb. Alternatively, one can specify x_vars and y_vars (each lists) to be the variables for each of those axes.

首先,成对的图可以变得很大。 要选择要显示的变量的子集,请使用vars参数,可以将其设置为变量名列表。 例如, sns.pairplot(data,vars=['a','b'])仅给出两列'a''b'之间的关系,即aaabbabb 。 或者,可以将x_varsy_vars (每个列表)指定为每个轴的变量。

The result of setting the first two plots (setting the vars parameter) is a symmetrical grid of plots:

设置前两个图(设置vars参数)的结果是一个对称的图网格:

Image for post

The third plot sets the y-component to only one variable — ‘sepal_length’ — and the x-component to all the columns of the data. This returns the interactions between that one column and all other columns. Note that for the first column — when it is paired against itself — and the fifth column — where it is paired against a categorical variable, the scatterplot is not an appropriate plot. We’ll explore how to deal with this later.

第三'sepal_length'图将y分量设置为仅一个变量'sepal_length' ,并将x分量设置为数据的所有列。 这将返回该一列与所有其他列之间的交互。 请注意,对于第一列(与它自身配对)和第五列(与类别变量配对),散点图不是合适的图。 稍后我们将探讨如何处理。

Image for post

By adding a kind=’reg’ keyword into your pairplot, you can get linear regression fits for the data. This is a great gage as to the linearity and variance of your data, which can lead to decisions about which types of models, both supervised and unsupervised, to choose. Additionally, since pairplots are symmetrical, to a) declutter the plot and b) reduce long loading times, setting corner=True removes the upper-right half, which is a duplicate.

通过在您的对图中添加kind='reg'关键字,您可以获得数据的线性回归拟合。 对于数据的线性和方差,这是一个很好的衡量标准,它可以决定要选择哪种类型的模型,包括监督模型和非监督模型。 此外,由于成对图是对称的,因此要a)整理曲线图和b)减少较长的加载时间,设置corner=True将删除右上半部分,这是重复项。

Image for post
Regression plot — left, corner plot — right
回归图-左图,角图-右图

The pairplot alone, however, is relatively limited in its ability to easily and intuitively display several relationships between variables. It is merely an interface to access the pairgrid, which is the real generator behind the ‘pairplot’. Properly handling visualization through pairgrid can yield valuable results.

然而, pairplot在其容易且直观地显示变量之间的几种关系的能力方面相对有限。 它仅仅是访问pairgrid的接口, pairgrid是“ pairplot ”背后的真正生成器。 通过pairgrid正确处理可视化pairgrid会产生有价值的结果。

Grids in seaborn are initialized to a variable, most commonly g (for grid).For instance, we may write g=sns.PairGrid(data). When grids are initialized, they are completely empty, but they will be filled in with visualizations soon. The grid is a method to access and visualize cross-feature aspects of the data in an efficient and clean way.

seaborn中的网格被初始化为一个变量,最常见的是g (对于网格)。例如,我们可以写g=sns.PairGrid(data) 。 初始化网格后,它们将完全为空,但是很快将被可视化填充。 网格是一种以有效且干净的方式访问和可视化数据的跨功能方面的方法。

Image for post

We can use map methods to fill in the grid with data. For instance, calling g.map(sns.scatterplot) fills the grid with scatterplots. We can also pass in the model’s parameters: in g.map(sns.kdeplot,shade=True), shade is a parameter of sns.kdeplot but it can be specified in the mapping. Since this is a grid, all the data is sorted out; we only need to call the type of plot.

我们可以使用地图方法用数据填充网格。 例如,调用g.map(sns.scatterplot)用散点图填充网格。 我们还可以传入模型的参数:在g.map(sns.kdeplot,shade=True) ,shade是sns.kdeplot的参数,但可以在映射中指定。 由于这是一个网格,因此将所有数据整理出来; 我们只需要调用情节类型即可。

Image for post

Note that the diagonals are still scatterplots. We can change this by using g.map_offdiag(sns.scatterplot) for plots not on the diagonal and g.map_diag(plt.hist) for plots on the diagonal. Note that we are able to use plotting objects from other libraries.

请注意,对角线仍然是散点图。 我们可以通过改变这个g.map_offdiag(sns.scatterplot)未对角和情节g.map_diag(plt.hist)的对角线上的地块。 注意,我们能够使用其他库中的绘图对象。

Image for post

We can do one better. Since the top and bottom halves are identical, we can change the plot type between the top and bottom halves using g.map_upper and g.map_lower. In this example, we compare the fits of quadratic and linear regression on the same data by varying the order parameter in seaborn’s regression plot, regplot.

我们可以做得更好。 由于上半部分和下半部分相同,因此我们可以使用g.map_upperg.map_lower在上半部分和下半部分之间更改绘图类型。 在此示例中,我们通过更改seaborn回归图regplot中的order参数,比较了二次回归和线性回归在同一数据上的拟合regplot

Image for post

To specify a hue, we can add the hue=’species’ parameter into the initialization of the PairGrid. Note that we cannot do something like g.map(sns.scatterplot, hue=’species’) because mapping is simply a visualization of the data, not a reprocessing of it. All the data is processed in the initialization of the grid, so all things data-related must be processed then.

要指定色调,我们可以将hue='species'参数添加到PairGrid的初始化中。 请注意,我们无法执行g.map(sns.scatterplot, hue='species')因为映射只是数据的可视化,而不是数据的重新处理。 所有数据都在网格的初始化中处理,因此所有与数据相关的事物都必须进行处理。

Image for post

Pairgrids are often used to build complex plots, but for the purposes of EDA, the operations covered should be enough.

Pairgrids通常用于构建复杂的地块,但就EDA而言,所涉及的操作应足够。

With a few more lines of code, you’ve been able to maximize the information gained from the pairplot and pairgrids. Here are some tips to take away as much insight as you can from it.

再多几行代码,您就可以最大化从pairplot和pairgrids获得的信息。 这里有一些技巧,您可以从中获得尽可能多的见识。

  • Look for curvatures and transformations (e.g. Tukey’s ladder of powers) that can be used to improve model performance.

    寻找可用于改善模型性能的曲率和变换(例如Tukey的幂阶)。
  • Approach features by how well they work in their entire row or column. For example, petal_width and petal_length perform well in separating classes along their designated axis very well across all other features. The same cannot be said for sepal_width, where there is much overlap along their axis. This means that it provides less information, can may be good cause for us to run a feature importance and remove it if it provides a negligible boost in predictive power.

    通过功能在整个行或整个列中的性能来评估功能。 例如,在所有其他petal_width ,沿着它们的指定轴分隔类时, petal_widthpetal_length性能很好。 sepal_width不能说sepal_width ,因为它们的轴上有很多重叠。 这意味着它提供的信息较少,如果它对预测能力的提升可忽略不计,则可能是促使我们发挥功能重要性并予以删除的良好原因。

  • Find how much data points vary from a regression fit (you can try different degrees as well) to get a visual understanding of how stable/stationary the data is. If data points vary widely from the fit and/or a fit must have a high degree to fit the data well, using methods like standardization or normalization may be helpful.

    查找与回归拟合有多少不同的数据点(您也可以尝试不同的程度),以直观了解数据的稳定性/平稳性。 如果数据点与拟合值相差很大,并且/或者拟合度必须高度匹配才能很好地拟合数据,则使用标准化或归一化等方法可能会有所帮助。
  • Spend a decent amount of time looking at visual bivariate representations of your data, playing around with comparisons and chart types. There are countless operations you can do to your data, and the purpose of EDA is not to give you answers but to spike your interest in taking a particular action. Data is different every time; no standard procedure fits all sizes.

    花大量的时间查看数据的可视双变量表示形式,进行比较和图表类型。 您可以对数据执行无数操作,而EDA的目的不是给您答案,而是激发您对采取特定行动的兴趣。 每次数据都不一样; 没有适合所有尺寸的标准程序。

翻译自: https://towardsdatascience.com/meet-your-new-best-exploratory-data-analysis-friend-772a60864227

认识数据分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:http://www.pswp.cn/news/388683.shtml
繁体地址,请注明出处:http://hk.pswp.cn/news/388683.shtml
英文地址,请注明出处:http://en.pswp.cn/news/388683.shtml

如若内容造成侵权/违法违规/事实不符,请联系英文站点网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

架构探险笔记10-框架优化之文件上传

确定文件上传使用场景 通常情况下,我们可以通过一个form(表单)来上传文件,就以下面的“创建客户”为例来说明(对应的文件名是customer_create.jsp),需要提供一个form,并将其enctype属…

matlab飞行数据仿真,基于MATLAB的飞行仿真

收稿日期: 2005 - 05 - 15   第 23卷  第 06期 计  算  机  仿  真 2006年 06月    文章编号: 1006 - 9348 (2006) 06 - 0057 - 05 基于 MATLAB的飞行仿真 张镭 ,姜洪洲 ,齐潘国 ,李洪人 (哈尔滨工业大学电液伺服仿真及试验系统研究所 ,黑龙江 哈尔滨 150001) 摘要:该…

Windows Server 2003 DNS服务安装篇

导读-- DNS(Domain Name System,域名系统)是一种组织成层次结构的分布式数据库,里面包含有从DNS域名到各种数据类型(如IP地址)的映射“贵有恒,何必三更起五更勤;最无益,只怕一日曝十日寒。”前一段时间巴哥因为一些生活琐事而中止…

正则表达式matlab,正则表达式中一个word的匹配 @MATLAB - 优秀的Free OS(Linux)版 - 北大未名BBS...

我目前想做的就是判断一个str是否可以被认为是有效的MATLAB index。最好的方法是直接运行,然后看运行结果或报错类型,但是我不打算在不知道是什么类型的东西之前运行它,所以可以预先parse一下,简单判断是否“长得跟有效的MATLAB i…

arima模型怎么拟合_7个统计测试,用于验证和帮助拟合ARIMA模型

arima模型怎么拟合什么是ARIMA? (What is ARIMA?) ARIMA models are one of the most classic and most widely used statistical forecasting techniques when dealing with univariate time series. It basically uses the lag values and lagged forecast error…

jQuery禁止Ajax请求缓存

一 现象 get请求在有些浏览器中会缓存。浏览器不会发送请求,而是使用上次请求获取到的结果。 post请求不会缓存。每次都会发送请求。 二 解决 jQuery提供了禁止Ajax请求缓存的方法: $.ajax({type: "get",url: "http://www.baidu.com?_&…

python 实例

参考 http://developer.51cto.com/art/201804/570408.htm 转载于:https://www.cnblogs.com/artesian0526/p/9552510.html

[WPF]ListView点击列头排序功能实现

[WPF]ListView点击列头排序功能实现 这是一个非常常见的功能,要求也很简单,在Column Header上显示一个小三角表示表示现在是在哪个Header上的正序还是倒序就可以了。微软的MSDN也已经提供了实现方式。微软的方法中,是通过ColumnHeader Templ…

天池幸福感的数据处理_了解幸福感与数据(第1部分)

天池幸福感的数据处理In these exceptional times, the lockdown left many of us with a lot of time to think. Think about the past and the future. Think about our way of life and our achievements. But most importantly, think about what has been and would be ou…

标线markLine的用法

series: [{markLine: {itemStyle: {normal: { lineStyle: { type: solid, color:#000 },label: { show: true, position:left } }},data: [{name: 平均线,// 支持 average, min, maxtype: average},{name: Y 轴值为 100 的水平线,yAxis: 100},[{// 起点和终点的项会共用一个 na…

php pfm 改端口,罗马2ESF和PFM 修改建筑 军团 派系 兵种等等等很多东西的教程

本帖最后由 clueber 于 2013-10-5 12:30 编辑本人是个罗马死忠加修改党,恩,所以分享一下自己的修改心得修改工具为ESF1.0.7和PFM3.0.3首先是ESF修改。ESF可以用来改开局设定和存档,修改开局设定是startpos.esf文件,在存档在我这里…

红草绿叶

从小到大喜欢阴天,喜欢下雨,喜欢那种潮湿的感觉。却又丝毫容不得脚上有一丝的水汽,也极其讨厌穿凉鞋。小时候特别喜欢去山上玩,偷桃子柿子,一切一切都成了美好的回忆,长大了,那些事情就都不复存…

wpf listview 使用

单列&#xff1a; <ListView Grid.Column"1" Height"284" HorizontalAlignment"Left" Margin"64,73,0,0" Name"listView1" VerticalAlignment"Top" Width"310" > <ListView.Items…

php 获取当天到23 59,js 获取当天23点59分59秒 时间戳 (最简单的方法)

原生Ajax 和Jq Ajax前言:这次介绍的是利用ajax与后台进行数据交换的小例子,所以demo必须通过服务器来打开.服务器环境非常好搭建,从网上下载wamp或xampp,一步步安装就ok,然后再把写好的页面放在服务器中指定的 ...『TCP&sol;IP详解——卷一&#xff1a;协议』读书笔记——1…

詹森不等式_注意詹森差距

詹森不等式背景 (Background) In Kaggle’s M5 Forecasting — Accuracy competition, the square root transformation ruined many of my team’s forecasts and led to a selective patching effort in the eleventh hour. Although it turned out well, we were reminded t…

【转载】儒林外史人物——荀玫

写在前面&#xff1a;本博客内容为转载&#xff0c;原文URL&#xff1a;http://blog.sina.com.cn/s/blog_9132ac5b0101iukw.html 说完周进&#xff0c;本应顺着说范进&#xff0c;但我觉得荀玫他们村的事情过于喜感&#xff0c;想先说荀玫。 荀玫简直是儒林中的某类标杆人物&am…

WebM VP8 SDK Usage/关于WebM VP8 SDK的用法

WebM是Google提出的新的网络视频格式&#xff0c;本质上是个MKV的壳&#xff0c;封装VPX中的VP8视频流与Vorbis OGG音频流。目前Firefox、Opera、Chrome都能直接打开WebM视频文件而无需其他任何乱七八糟的插件。我个人倒是很喜欢WebM的OGG音频&#xff0c;虽然在低比特率下不如…

数据分析师 需求分析师_是什么让分析师出色?

数据分析师 需求分析师重点 (Top highlight)Before we dissect the nature of analytical excellence, let’s start with a quick summary of three common misconceptions about analytics from Part 1:在剖析卓越分析的本质之前&#xff0c;让我们从第1部分中对分析的三种常…

JQuery发起ajax请求,并在页面动态的添加元素

页面html代码&#xff1a; <li><div class"coll-tit"><span class"coll-icon"><iclass"sysfont coll-default"></i>全域旅游目的地</span></div><div class"coll-panel"><div c…