四职业竖屏传奇手游,9377奇奇猴单职业迷失,单职业迷失版本传奇发布网站

数据框的塑形是学习R语言过程中的一个重要的知识点，是从Excel思维转换成编程思维的重要过程。尽管很多作图函数一个命令就可以让输入数据变成美轮美奂的图，但往往初学者会卡在如何制作符合需要的input data上面。根源原因就是Excel的数据储存思维与R语言有很大不同。很多时候日常记录在Excel里面的数据格式不适合直接用，而在Excel里面“点点点和调调调”又非常违背学习R语言的“懒人精神”。所以为了节省下“点点点和调调调”的枯燥时间，就要花时间学习一下数据塑形。

本文要介绍的2个包3对函数都出自哈德利大神。一般人写包就是只是造个工具，大神直接定义了一种数据转换的思维。直接上表格，用眼睛体会一下表格的“长”和“扁”，省去大段文字描述。

test <- data.frame(geneid = paste0("gene",1:4),
                   sample1 = c(1,4,7,10),
                   sample2 = c(2,5,0.8,11),
                   sample3 = c(0.3,6,9,12))
test

我先不说这个test它是个“扁”表格，因为当我把它变“长”了，就能感受到它的“扁”了。

# 先不用管这里的包和函数，体会表格
library(tidyr)
gather(data = test,
                      key = sample_nm,
                      value = exp,
                      - geneid)

数据框“扁”变“长”就是将所有观测值汇总成了一列，所有变量名也汇总成列。因为R语言数据处理的时候基本都是取列（向量）来用的，具体列里面的元素取哪些，就靠筛选了。所以要变成这样。

reshape2 —— melt & dcast

我盲猜这个包是最先出现的，因为包里面的函数最少，文档里面有这样一句话：

Reshape (hopefully) makes it easy to do what you have been struggling to do with tapply, by, aggregate, xtabs, apply and summarise. It is also useful for getting your data into the correct structure for lattice or ggplot plots.

感觉它就是为ggplot应运而生的。它俩还一起获得了2006年的 John Chambers Award for Statistical Computing奖项。尽管现在用的都是ggplot2了，但这些旧的包和函数都和好用，生命力旺盛着呢。这三对函数之间没有巨大差别，细微之处遇见的比较少，选哪个，顺手就是好的

melt 变长

melt基本语法是：

melt(data,id.vars,measure.vars,variable.name='variable',...,na.rm=FALSE,value.name='value',factorAsStrings=TRUE)

只需要记住前三个变量，后面的都不用管：

data：指的你想处理的数据框
id.vars：是不想融合的变量，可以是一个也可以是多个（括在向量里）
measure.vars：想要融合的变量
variable.name & value.name：默认会把融合的变量名储存的列的列名定义为“variable”，而观测值的列名定义为“value”，有了这两个参数，可以在融合时就修改成自己想要的名字。

library(reshape2)
test_melt <- melt(test,id.vars = "geneid",measure.vars = c("sample1","sample2","sample3"))

dcast 变扁

“长”变“扁”用的是dcast()函数，语法是：

dcast(data, formula, fun.aggregate = NULL, ..., margins = NULL,
  subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data))

其中需要掌握的是参数formula的格式

rowvar1 + rowvar2 +...  ~  colvar1 + colvar2 +...

按照rowvar，展开colvar。也可以理解为rowvar是不变的，colvar是拆分的

我们试试将变化的数据再恢复。和原始数据比较一下，是一样的。

test_dcast <- dcast(test_melt, geneid ~ variable)
test_dcast
identical(test, test_dcast)

dcast很棒的功能——聚合运算

dcast有一个很棒的功能就是可以在聚合的时候（dcast称为聚合的过程）可以同步对数据执行函数运算。

fun.aggregate：用于指定聚合函数，对已聚合的数据执行聚合运算

演示的时候由于test没有相同的分组，所以我们加一列type。

x = test
x$type = rep(c("A","B"),2)
x

可以看到重塑以后的数据对各列按照type进行了均值（mean）计算。

x_melt = melt(x,id.vars = c("geneid","type"),measure.vars = c("sample1","sample2","sample3"))
dcast(x_melt, type ~ variable, mean)

tidyr —— gather & spread

tidyr是哈德利大神的神作tidyverse中的成员。专注数据框整理，除了gather和spread还有许多别的函数。这里只关注这两个函数。

gather 变长

这对函数开始提出了key和value的概念。key就是由变量名组成的一列，相当于melt里面的id.vars。value就是观测值，相当于measure.vars。对于不想变化的列，就在列名前面加一个"-"(减号)。

用法是：

gather(data,key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

test_gather <- gather(data = test,
                      key = sample_nm,
                      value = exp,
                      - geneid)
test_gather

spread 变扁

用法是：

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

演示一下test数据，和原始数据比较果然也是一样的。

test_spread <- spread(data = test_gather,
                  key = sample_nm,
                  value = exp)
test_spread
identical(test,test_spread)

注意到spread里面有个fill参数，是处理NA值的，默认是给填上"NA"。

test_na <- test_gather[c(1:2,5:11),]
test_na_re <- spread(data = test_na,
                     key = sample_nm,
                     value = exp)
test_na_re # 赋值为NA
# 也可以填上别的，字符串，一个确定的数字都可以
test_na_fill <- spread(data = test_na,
                     key = sample_nm,
                     value = exp,
                     fill = "AAA")

参数drop和gather里面的na.rm一样，就是对于有缺失的观测值是否显示。

`melt`和`gather`的比较

我们来比较一下melt和gather的结果。

colnames(test_melt) <- colnames(test_gather) # 列名不一致先处理一下
identical(test_gather,test_melt)

结果是不一样的，很意外吧。肉眼上是完全看不出来的

test_gather
test_melt

str一下能看出来，数据的格式不一样。gather将汇总的变量名处理成字符型，而melt处理成因子型（即使是在options(stringsAsFactors = FALSE)的大环境下），这样作图的时候就会产生影响，所以个人觉得gather会更好用一些，当需要变因子的时候，再手动调整，也可以自定义因子的顺序。

tidyr —— pivot_longer & pivot_wider

在查看gather帮助文档的时候会发现这个包的状态是“retired”。尽管并不影响我们对这个包的热爱和使用，但好奇心驱使我去查看了文档底下推荐的pivot，然后就看到了大神对自己包的官方吐槽。

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.

他们的用法如下，只保留了重要参数。详细的参见vignette

pivot_longer(
  data,
  cols,
  names_to = "name",
  values_to = "value")

pivot_wider(
  data,
  id_cols = NULL,
  names_from = name,
  values_from = value)

这两对函数和gather & spread没有本质区别，只是让参数“说人话”了，使用起来更友好了一些。所以对应的参数名称是：

gather & spread	pivot_X
key	names_to
value	values_to
key	names_from
value	values_from

其他资料

另外看到了一篇文章关于tidyr和reshape2的比较。https://rpubs.com/sterding/Reshape2_and_Tidyr

Tidyr vs. Reshape2

They are very similar!
cast() in reshape2 can work on matrix/array, while gather() in tidyr can only work on data.frame.
reshape2 can do aggregation, while tidyr is not designed for this purpose.
colsplit() in reshape2 operates only on a single column while separate() in tidyr performs all the operation at once.

传奇手游全部平台_三端传奇开服网址大全下载_三端传奇版本下载教程

数据框塑形——Tidyr vs Reshape

数据框塑形——Tidyr vs Reshape

reshape2 —— melt & dcast

melt 变长

dcast 变扁

dcast很棒的功能——聚合运算

tidyr —— gather & spread

gather 变长

spread 变扁

`melt`和`gather`的比较

tidyr —— pivot_longer & pivot_wider

其他资料

Tidyr vs. Reshape2

传奇手游全部平台_三端传奇开服网址大全下载_三端传奇版本下载教程

数据框塑形——Tidyr vs Reshape

reshape2 —— melt & dcast

melt 变长

dcast 变扁

dcast很棒的功能——聚合运算

tidyr —— gather & spread

gather 变长

spread 变扁

melt和gather的比较

tidyr —— pivot_longer & pivot_wider

其他资料

Tidyr vs. Reshape2

`melt`和`gather`的比较