When copying code, I always encounter the problem that the original data should be long?

It is often asked that the way to import data in the tutorial of seeing the sharing is to directly call the system’s data such as data (dune), and how do you read your own data?

For starters, this is a real problem. How to prepare the data, get the data in the correct format and import the subsequent code for analysis is the first obstacle in the learning and application process.

Why are tutorials used to using built-in data?

  1. Simple and convenient, portable and repeatable; This is one of the advantages of built-in data;
  2. The built-in data pattern is clear, and better results can usually be obtained; This is the second advantage of built-in data;
  3. Others use this, I use this, which is a lazy practice.
  4. Everyone’s common sense is different. The author may find this too simple and ignore the needs of beginners. (What does Shengxin learn?) Common sense! )

But the frequent use of built-in data is what causes beginners to often ask this question above when learning this tutorial.

I’m not in favor of using built-in data in tutorials because:

  1. Not friendly to people who won’t read data;
  2. It’s not good to explore the problems you might encounter when using this tutorial for real-world data. The sample data runs without brains, and there are no significant differences in their own data.

If you want to use built-in data, you also need to provide some additional information:

  1. To describe in detail the format and biological meaning of the built-in data, and the correspondence with the real data, you can refer to drawing a PCoA analysis result with statistical testing
  2. Provide examples of the format of real data and code that reads in real data to bridge this “gap”;
    For example, to write this article: Did your adonis use it correctly? The order of the different factors has a big impact on the results because the sample data is significantly different, and there is no difference in its own data. Therefore, it understands the calculation process from the principle and explores the solution.
  3. Mention the resolution of possible problems; This is also the part that can only be written after operating multiple sets of actual data.

So if the tutorial does not provide such details, you have to use this tutorial yourself, how to do it?

How to prepare and read your own data based on the tutorial’s data

1. Review the structure of the data to understand the composition of the data

Now that the tutorial provides a test dataset, take a closer look at the characteristics of the test dataset and you may be looking for a pattern.

Let’s take the dune dataset mentioned in the previous article as an example to see its structural characteristics.

Row names are numbers and column names are strings (if we are not familiar with these strings, it doesn’t make any sense to us; Each character is recognized, the string together does not know what it is ~~), the middle value is an integer. I don’t see any other information.

library(vegan)
data(dune)
head(dune)
##   Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu Cirsarve Comapalu Eleopalu Elymrepe Empenigr
## 1        1        0        0        0        0        0        0        0        0        0        0        4        0
## 2        3        0        0        2        0        3        4        0        0        0        0        4        0
## 3        0        4        0        7        0        2        0        0        0        0        0        4        0
## 4        0        8        0        2        0        2        3        0        2        0        0        4        0
## 5        2        0        0        0        4        2        2        0        0        0        0        4        0
## 6        2        0        0        0        3        0        0        0        0        0        0        0        0
##   Hyporadi Juncarti Juncbufo Lolipere Planlanc Poaprat Poatriv Ranuflam Rumeacet Sagiproc Salirepe Scorautu Trifprat Trifrepe
## 1        0        0        0        7        0       4       2        0        0        0        0        0        0        0
## 2        0        0        0        5        0       4       7        0        0        0        0        5        0        5
## 3        0        0        0        6        0       5       6        0        0        0        0        2        0        2
## 4        0        0        0        5        0       4       5        0        0        5        0        2        0        1
## 5        0        0        0        2        5       2       6        0        5        0        0        3        2        2
## 6        0        0        0        6        5       3       4        0        6        0        0        3        5        5
##   Vicilath Bracruta Callcusp
## 1        0        0        0
## 2        0        0        0
## 3        0        2        0
## 4        0        2        0
## 5        0        2        0
## 6        0        6        0

2. View help with the data

If you can’t find useful information from the data structure and column names, let’s check out the help information.

?dune

dune is a data frame of observations of 30 species at 20 sites. The
species names are abbreviated to 4+4 letters (see make.cepnames).

What does this tell us? This set of data contains abundance information for 30 species in 20 samples. As can be seen from the dim (dune) this is a matrix of 20 rows X30 columns; It can be inferred that each row is a sample and each column is a species
(another corroboration is that the column names are indeed 8 characters long, which is consistent with the 4+4 abbreviation of the species name).

Note: If you still have doubts about the data, it is recommended that Google put the data down. Common built-in datasets will have articles describing their information that can be used to support your judgment.

dim(dune)
## [1] 20 30

This format is slightly different from our usual OTU abundance table
(our table is usually a species per row and a sample per column).

3. After the basic judgment, read in our data and make possible transformations

If we had an OTU abundance table, how would we read it in and convert it to this format?

text <- "ID\tSamp1\tSamp2\tSamp3\tSamp4
OTU1\t2\t13\t14\t15
OTU2\t12\t13\t8\t10
OTU3\t22\t10\t14\t11"
otu_table <- read.table(text=text, sep="\t", row.names=1, header=T)

Read into the OTU abundance table, the first line column name, the first column line name.

otu_table <- read.table("otutable_rare",sep="\t", row.names=1, header=T)

By making a transpose based on the above analysis, you can obtain input data that can be used for subsequent analysis.

otu_table_t <- as.data.frame(t(otu_table))
otu_table_t
##       OTU1 OTU2 OTU3
## Samp1    2   12   22
## Samp2   13   13   10
## Samp3   14    8   14
## Samp4   15   10   11

4. What do integers in the sample data mean?

This is the more difficult part to determine, there are only two ways to judge: 1)
the author can mention in the tutorial (this is the most accurate method); 2) Guess from experience.

Here’s another question that gets asked another question:

Do I need to provide raw data for this step, or standardized data?

In the vast majority of cases, all we need to provide is standardized data that is comparable between different samples. Because: 1) our need is to compare the differences between different samples, and the data needs to be comparable between samples; 2) Most of the tools will not standardize the data, either directly used, or do some conversions that do not affect the value relationship; 3) If a tool standardizes the data internally, it will definitely be mentioned in the help, such as DESeq2, edgeR, limma In addition to these two halves (limma counts as half because it can also receive standardized data), I can’t remember which other tools accept raw data. The exception is the single-cell Seurat package, which internally calls some standardized algorithms that can be turned off by parameters.

5. See more tutorials, you will always encounter tutorials that describe the required data structures in detail.

6. Follow the feeling, no matter whether three seven twenty-one reads in and tries, there is an abnormality or an error and then adjust. Learning the program is not to do experiments, the cost of trial and error is not so large, just looking at not practicing is a false handle, and bold trying is king.

7. The final step is to communicate with the tutorial author. Our tutorial questions are welcome to be discussed in the http://www.ehbio.com/Esx post; After working your own, it is easier to get answers to discussions with questions and ideas.

Be the first to comment

Leave a Reply

Your email address will not be published.


*