Information

3.2: Variables and Data - Biology

3.2: Variables and Data - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Like most languages, R lets us assign data to variables. In fact, we can do so using either the=assignment operator or the<-operator, though the latter is most commonly found and generally preferred.

Here,print()is a function, which prints the contents of its parameter (to the interpreter window in RStudio, or standard output on the command line). This function has the “side effect” of printing the output but doesn’t return anything.[1] By contrast, theabs()function returns the absolute value of its input without any other effects.

The interpreter ignores#characters and anything after them on a single line, so we can use them to insert comments in our code for explanation or to improve readability. Blank lines are ignored, so we can add them to improve readability as well.

You might be curious why the extra[1]is included in the printed output; we’ll return to that point soon, but for now, let it suffice to say that the number4.4is the first (and only) of a collection of values being printed.

The right-hand side of an assignment is usually evaluated first, so we can do tricky things like reuse variable names in expressions.

Variable and function names in R deserve some special discussion. There are a variety of conventions, but a common one that we’ll use is the same convention we used for Python: variable names should (1) consist of only letters and numbers and underscores, (2) start with a lowercase letter, (3) use underscores to separate words, and (4) be meaningful and descriptive to make code more readable.

In R, variable and function names are also allowed to include the.character, which contains no special meaning (unlike in many other languages). So,alpha.abs <- abs(alpha)is not an uncommon thing to see, though we’ll be sticking with the conventionalpha_abs <- abs(alpha). R variables may be almost anything, so long as we are willing to surround the name with back-tick characters. So,`alpha abs` <- abs(alpha)would be a valid line of code, as would a following line likeprint(`alpha abs`), though this is not recommended.

Numerics, Integers, Characters, and Logicals

One of the most basic types of data in R is the “numeric,” also known as a float, or floating-pointing number in other languages.[2] R even supports scientific notation for these types.

R also provides a separate type for integers, numbers that don’t have a fractional value. They are important, but less commonly seen in R primarily because numbers are created as numerics, even if they look like integers.

It is possible to convert numeric types to actual integer types with theas.integer()function, and vice versa with theas.numeric()function.

When converting to an integer type, decimal parts are removed, and thus the values are rounded toward0(4.8becomes4, and-4.8would become-4.)

The “character” data type holds a string of characters (though of course the string may contain only a single character, or no characters as in''). These can be specified using either single or double quotes.

Concatenating character strings is trickier in R than in some other languages, so we’ll cover that in chapter 32, “Character and Categorical Data.” (Thecat()function works similarly, and allows us to include special characters like tabs and newlines by using and , respectively;cat("Shawn O'Neil")would output something likeShawn O'Neil.)

Character types are different from integers and numerics, and they can’t be treated like them even if they look like them. However, theas.character()andas.numeric()functions will convert character strings to the respective type if it is possible to do so.

By default, the R interpreter will produce a warning (NAs induced by conversion) if such a conversion doesn’t make sense, as inas.numeric("Shawn"). It is also possible to convert a numeric or integer type to a character type, usingas.character().

The “logical” data type, known as a Boolean type in other languages, is one of the more important types for R. These simple types store either the special valueTRUEor the special valueFALSE(by default, these can also be represented by the shorthandTandF, though this shorthand is less preferred because some coders occasionally useTandFfor variable names as well). Comparisons between other types return logical values (unless they result in a warning or error of some kind). It is possible to compare character types with comparators like<and>; the comparison is done in lexicographic (dictionary) order.

But beware: in R (and Python), such comparisons also work when they should perhaps instead result in an error: character types can be validly compared to numeric types, and character values are always considered larger. This particular property has resulted in a number of programming mistakes.

R supports<,>,<=,>=,==, and!=comparisons, and these have the same meaning as for the comparisons in Python (see chapter 17, “Conditional Control Flow,” for details). For numeric types, R suffers from the same caveat about equality comparison as Python and other languages: rounding errors for numbers with decimal expansions can compound in dangerous ways, and so comparing numerics for equality should be done with care. (You can see this by trying to runprint(0.2 * 0.2 / 0.2 == 0.2), which will result inFALSE; again, see chapter 17 for details.[3]) The “official” way to compare two numerics for approximate equality in R is rather clunky:isTRUE(all.equal(a, b))returnsTRUEifaandbare approximately equal (or, if they contain multiple values, all elements are). We’ll explore some alternatives in later chapters.

Speaking of programming mistakes, because<-is the preferred assignment operator but=is also an assignment operator, one must be careful when coding with these and the==or<comparison operators. Consider the following similar statements, all of which have different meanings.

R also supports logical connectives, though these take on a slightly different syntax than most other languages.

ConnectiveMeaningExample (witha <- 7,b <- 3)
&and:Trueif both sides areTruea < 8 & b == 3 # True
|or:Trueif one or both sides areTruea < 8 | b == 9 # True
!not:Trueif following isFalse! a < 3 # True

These can be grouped with parentheses, and usually should be to avoid confusion.

When combining logical expressions this way, each side of an ampersand or|must result in a logical—the codea == 9 | 7is not the same asa == 9 | a == 7(and, in fact, the former will always result inTRUEwith no warning).

Because R is such a dynamic language, it can often be useful to check what type of data a particular variable is referring to. This can be accomplished with theclass()function, which returns a character string of the appropriate type.

We’ll do this frequently as we continue to learn about various R data types.

Exercises

  1. Given a set of variables,a,b,c, andd, find assignments of them to eitherTRUEorFALSEsuch that theresultvariable holdsTRUE.
  2. Without running the code, try to reason out whatprint(class(class(4.5)))would result in.
  3. Try converting a character type like"1e-50"to a numeric type withas.numeric(), and one like"1x10^5". What are the numeric values after conversion? Try converting the numeric value0.00000001to a character type—what is the string produced? What are the smallest and largest numerics you can create?
  4. Theis.numeric()function returns the logicalTRUEif its input is a numeric type, andFALSEotherwise. The functionsis.character(),is.integer(), andis.logical()do the same for their respective types. Try using these to test whether specific variables are specific types.
  5. What happens when you run a line likeprint("ABC"* 4)? What aboutprint("ABC" + 4)? Why do you think the results are what they are? How aboutprint("ABC" + "DEF")? Finally, try the following:print(TRUE + 5),print(TRUE + 7),print(FALSE + 5),print(FALSE + 7),print(TRUE * 4), andprint(FALSE * 4). What do you think is happening here?


Understanding types of variables

Published on November 21, 2019 by Rebecca Bevans. Revised on March 2, 2021.

In statistical research, a variable is defined as an attribute of an object of study. Choosing which variables to measure is central to good experimental design.

Example

If you want to test whether some plant species are more salt-tolerant than others, some key variables you might measure include the amount of salt you add to the water, the species of plants being studied, and variables related to plant health like growth and wilting.

You need to know which types of variables you are working with in order to choose appropriate statistical tests and interpret the results of your study.

You can usually identify the type of variable by asking two questions:


Measurement variables are, as the name implies, things you can measure. An individual observation of a measurement variable is always a number. Examples include length, weight, pH, and bone density. Other names for them include "numeric" or "quantitative" variables.

Some authors divide measurement variables into two types. One type is continuous variables, such as length of an isopod's antenna, which in theory have an infinite number of possible values. The other is discrete (or meristic) variables, which only have whole number values these are things you count, such as the number of spines on an isopod's antenna. The mathematical theories underlying statistical tests involving measurement variables assume that the variables are continuous. Luckily, these statistical tests work well on discrete measurement variables, so you usually don't need to worry about the difference between continuous and discrete measurement variables. The only exception would be if you have a very small number of possible values of a discrete variable, in which case you might want to treat it as a nominal variable instead.

When you have a measurement variable with a small number of values, it may not be clear whether it should be considered a measurement or a nominal variable. For example, let's say your isopods have (20) to (55) spines on their left antenna, and you want to know whether the average number of spines on the left antenna is different between males and females. You should consider spine number to be a measurement variable and analyze the data using a two-sample t&ndashtest or a one-way anova. If there are only two different spine numbers&mdashsome isopods have (32) spines, and some have (33)&mdashyou should treat spine number as a nominal variable, with the values "(32)" and "(33)" and compare the proportions of isopods with (32) or (33) spines in males and females using a Fisher's exact test of independence (or chi-square or G&ndashtest of independence, if your sample size is really big). The same is true for laboratory experiments if you give your isopods food with (15) different mannose concentrations and then measure their growth rate, mannose concentration would be a measurement variable if you give some isopods food with (5mM) mannose, and the rest of the isopods get (25mM) mannose, then mannose concentration would be a nominal variable.

But what if you design an experiment with three concentrations of mannose, or five, or seven? There is no rigid rule, and how you treat the variable will depend in part on your null and alternative hypotheses. If your alternative hypothesis is "different values of mannose have different rates of isopod growth," you could treat mannose concentration as a nominal variable. Even if there's some weird pattern of high growth on zero mannose, low growth on small amounts, high growth on intermediate amounts, and low growth on high amounts of mannose, a one-way anova could give a significant result. If your alternative hypothesis is "isopods grow faster with more mannose," it would be better to treat mannose concentration as a measurement variable, so you can do a regression.

The following rule of thumb can be used:

  • a measurement variable with only two values should be treated as a nominal variable
  • a measurement variable with six or more values should be treated as a measurement variable
  • a measurement variable with three, four or five values does not exist

Of course, in the real world there are experiments with three, four or five values of a measurement variable. Simulation studies show that analyzing such dependent variables with the methods used for measurement variables works well (Fagerland et al. 2011). I am not aware of any research on the effect of treating independent variables with small numbers of values as measurement or nominal. Your decision about how to treat your variable will depend in part on your biological question. You may be able to avoid the ambiguity when you design the experiment&mdashif you want to know whether a dependent variable is related to an independent variable that could be measurement, it's a good idea to have at least six values of the independent variable.

Something that could be measured is a measurement variable, even when you set the values. For example, if you grow isopods with one batch of food containing (10mM) mannose, another batch of food with (20mM) mannose, another batch with (30mM) mannose, etc. up to (100mM) mannose, the different mannose concentrations are a measurement variable, even though you made the food and set the mannose concentration yourself.

Be careful when you count something, as it is sometimes a nominal variable and sometimes a measurement variable. For example, the number of bacteria colonies on a plate is a measurement variable you count the number of colonies, and there are (87) colonies on one plate, (92) on another plate, etc. Each plate would have one data point, the number of colonies that's a number, so it's a measurement variable. However, if the plate has red and white bacteria colonies and you count the number of each, it is a nominal variable. Now, each colony is a separate data point with one of two values of the variable, "red" or "white" because that's a word, not a number, it's a nominal variable. In this case, you might summarize the nominal data with a number (the percentage of colonies that are red), but the underlying data are still nominal.

Ratios

Sometimes you can simplify your statistical analysis by taking the ratio of two measurement variables. For example, if you want to know whether male isopods have bigger heads, relative to body size, than female isopods, you could take the ratio of head width to body length for each isopod, and compare the mean ratios of males and females using a two-sample t&ndashtest. However, this assumes that the ratio is the same for different body sizes. We know that's not true for humans&mdashthe head size/body size ratio in babies is freakishly large, compared to adults&mdashso you should look at the regression of head width on body length and make sure the regression line goes pretty close to the origin, as a straight regression line through the origin means the ratios stay the same for different values of the (X) variable. If the regression line doesn't go near the origin, it would be better to keep the two variables separate instead of calculating a ratio, and compare the regression line of head width on body length in males to that in females using an analysis of covariance.

Circular variables

One special kind of measurement variable is a circular variable. These have the property that the highest value and the lowest value are right next to each other often, the zero point is completely arbitrary. The most common circular variables in biology are time of day, time of year, and compass direction. If you measure time of year in days, Day 1 could be January 1, or the spring equinox, or your birthday whichever day you pick, Day 1 is adjacent to Day 2 on one side and Day 365 on the other.

If you are only considering part of the circle, a circular variable becomes a regular measurement variable. For example, if you're doing a polynomial regression of bear attacks vs. time of the year in Yellowstone National Park, you could treat "month" as a measurement variable, with March as (1) and November as (9) you wouldn't have to worry that February (month (12)) is next to March, because bears are hibernating in December through February, and you would ignore those three months.

However, if your variable really is circular, there are special, very obscure statistical tests designed just for circular data chapters 26 and 27 in Zar (1999) are a good place to start.


3 Most Important Types of Biological Variables

Each biological discipline has its own set of variables, which may include conventional morphological measurements, concentra­tions of chemicals in body fluids, rates of certain biological proces­ses, frequencies of certain events as in genetics and radiation bio­logy and many more.

Image Courtesy : limno.eu/LTER/immagini_limno/FiorituraAnabaena.jpg

A variable can be defined as a property with respect to which individuals in a sample differ in some as certain able way. If the property does not differ within a sample at hand or at least among the samples being studied, it cannot be of statistical interest. Length, height, weight, number of teeth, vitamin C con­tents and genotypes are examples of variables in ordinary, genetically and phenotypically diverse, groups of organisms.

Warm-bloodedness in a group of mammals is not, since they are all alike in this regard, although body temperature of individual mammals would, of cour­se, be a variable.

Types of biological variables:

Biological variables have been classified into following types:

1. Measurement variables:

Measurement variables are all those whose differing states can be expressed in a numerically ordered fas­hion. They are divisible into two kinds. The first of these are continuous variables, which at least theoretically can assume an infinite number of values between any two fixed points.

For exam­ple, between two length measurements 1.5 and 1.6 cm there is an infinite number of lengths that could be measured if one were so inclined and had a precise enough method of calibration to obtain such measurements.

Any given reading of a continuous variable, such as a length of 157 mm is, therefore, an approximation to the exact reading, which in practice is uncommon. Some common examples of biological continuous variables are lengths, areas, volumes, weights, angles, temperatures, periods of time, percen­tages, and rates.

Contrasted with continuous variables are the discontinuous vari­ables, also known as meristic or discrete variables. These are vari­ables that have only certain fixed numerical values, with no inter­mediate values possible in between. Thus, the number of segments in a certain insect appendage may be 4 or 5 or 6, but never 51/2 or 4.3.

Examples of discontinuous variables are numbers of certain structure (such as segments, bristles, teeth or glands), the numbers of offspring, the numbers of colonies of micro-organisms or ani­mals, or the numbers of plants in a given quadrate.

2. Ranked variables:

Some variables cannot be measured but at least can be ordered or ranked by their magnitude. Thus, in an experiment one might record the rank order of emergence of ten pupae without specifying the exact time at which each pupa emerged. In such cases the data is coded as a ranked variable, the order of emergence.

Thus, by expressing a variable as a series of ranks, such as 1, 2, 3, 4, 5, we do not imply that the difference in magnitude between, say, ranks 1 and 2 is identical to or even pro­portional to the difference between 2 and 3.

3. Attributes:

Variables that cannot be measured but must be expressed qualitatively are called attributes. These are all properties, such as black or white, pregnant or not-pregnant, dead or alive, male or female. When such attributes are combined with frequencies, they can be treated statistically.

For example, of 80 mice, we may state that four were black, two agouti, and the rest gray. When attributes are combined with frequencies into tables suitable for statistical analysis, they are referred as enumeration data. Thus, the enumeration data on colour in mice just mentioned would be arranged as follows:


Statistical Data / Variables – Introduction (Classification of Statistical Data / Variable – Numeric vs Categorical)

Ø Data is a set of values of qualitative or quantitative variables.

Ø In biostatistics (also in statistics) data are the individual observations.

Ø The scientific investigations involve observations on variables.

Ø The observations made on these variables are obtained in the form of ‘data’.

Ø Variable is a quantity or characteristic which can ‘vary from one individual to another’.

Ø Example: Consider the characteristic ‘weight’ of individuals and let it be denoted by the letter ‘N’. The value of ‘N’ varies from one individual to another and thus, ‘N’ is a variable.

Ø Data and variable are not exact but used frequently as synonyms.

Ø The variables can also be called as ‘data items’.

Ø Majority of the statistical analysis are done on variables.

Type of Variables in Statistics

Statistical variables can be classified based on two criterion (I) Nature of Variables and (II) Source of variables

I. Classification of variable based on Nature of Variables

Ø Based on the nature of variables, statistical variables can be classified to TWO major categories such as (1) Numerical and (2) Categorical.

Ø The classification chart of variables is given below:

(1). Numerical Variable

Ø Numerical variables are the measurable or countable variables.

Ø They are better called as quantitative variable because they give the quantitative data.

Ø Example: plant height, fruit weight, crop yield, number of petals, seeds, leaves in a plant etc.

Ø Numerical variables are further categorized into (a) Discrete variables and (b) Continuous variables.

(a) Discrete variables:

Ø Discrete variables are also called as discontinuous variables.

Ø Here, the values which variables can assume are limited to whole numbers only (0, 1, 2, 3 etc.).

Ø There will be ‘gaps’ between the successive values of the variable.

Ø Example: Consider the number of petals in a flower as a discrete variable X. In the real situation, the number of petals in a flower may be 4 or 5 or 6 or any whole numbers. There will not be a variable such as 5 ½ petals or 4.2 petals. Such variables are called discrete variables or discontinuous variables.

Ø Example: number of brothers, number of petals etc.

(b) Continuous variables

Ø Continuous are those variables that can take any value within a certain range.

Ø There are NO ‘gaps’ between the successive values of the variable.

Ø Example: Consider the height of plant as the variable X. In real situation the height of plant may be 10 cm, 10.1 cm, 10.5 cm, 10.8 cm, 11 cm etc. Thus, between two whole numbers (here 10 and 11), there are numerous possible values. Such a variable is called continuous variable.

Ø Examples: height, weight, length, speed etc.

(2). Categorical Variable

Ø Categorical variables are un-measurable variables.

Ø They are also called as non-numerical or qualitative variable since they give qualitative data.

Ø Example: colour of flower, shape of leaves, shape of seeds etc.

Ø Categorical variables are further classified into (a) Nominal variables and (b) Ordinal variables.

(a). Nominal Variables:

Ø Nominal variables have distinct levels that have NO inherent ordering.

Ø Example: Hair colour (white, black, brown etc.), gender (male and female).

Ø In statistics the nominal measurement means the awarding of a numeral value to a specific characteristic (example: Gender of employees in an office: male 20, female 28).

(b). Ordinal Variables :

Ø Ordinal variables have levels that follow distinct ordering.

Ø Examples: The degrees of changes in fever patient after the antibiotic treatment (such as: vast improvement, moderate improvement, no change, death).

II. Classification of variable based on Source of Variables

Ø Based on the source of data (variables), the data can be classified into (a) Primary Data and (b) Secondary Data

(a). Primary Data

Ø The data originally collected in the process of investigation by the investigator is called primary data.

Ø Primary data are more accurate and uniform.

Ø Primary data involves the supervision of the investigator.

Ø Primary data collection is time and labour consuming.

Ø Biological studies, particularly experimental studies, primarily depend on primary data.

(b). Secondary Data

Ø Secondary data is the data collected by some other person or organization for their own use.

Ø It is the data that already in existence for the same or other purpose than answering of the question in hand (Blair M.M.).

Ø Secondary data are usually published data by the primary investigator.

Ø Getting the secondary data is advantageous since it is less expensive and less time consuming.

Ø Secondary data is frequently used in disciplines such as economics, commerce, agriculture, public health etc.

Ø Example: population census data, national mortality rate, annual rain fall, budget records etc.

Ø Research results published in reputed journals can also acts as secondary data.

Source of Secondary Data

Ø Published sources are the excellent and frequently used source of secondary data.

Ø These are the records published or maintained by government and non-governmental agencies such as department of census, department of statistics, health department, agriculture and fisheries department, official publications of UN, WHO, UNEP, UNESCO etc. are good source of secondary data.

Ø Important sources of secondary data are summarized below:

(a). International publications: These are the regular or occasional reports of international organizations such as UN, WHO, WWF, IMF (International monetary fund) etc.

(b). Official publications of the state and central government: These are the publications by the state of central government on current issues or regular periodic reports. Example: Census of India, Reserve bank bulletin, Report of currency and finance etc.

(c). Committee reports: these are the reports of enquiry commissions appointed by the government. Example: Madhav Gadgil committee report, Kasturirangan committee report etc.

(d). Newspapers and magazines: These are the important review reports and articles published in reputed newspapers and magazines.

(e). Research scholars: They are the reports or results of the previous research published on reputed journals.

(f). Semi-official publications: These are the publications by the semi-governmental organizations such as municipalities, provinces etc.

Ø Apart from published data, some genuine but unpublished data can also be used as the source of secondary data with great precaution.

Care to be taken before taking the secondary data

Ø Before taking the secondary data, the investigator must enquire about the following aspects of the data:

$ The reliability of the data.

$ The competency of the individual (or organization) who collected the data.


Questions & Answers

Question Context 1

Consider the following function.

1)If we execute following commands (written below), what will be the output?

Scoping rule of R will cause z<-4 to take precedence over z<-10. Hence, g(x) will return a value of 8. Therefore, option A is the correct answer.

Question context 2

The iris dataset has different species of flowers such as Setosa, Versicolor and Virginica with their sepal length. Now, we want to understand the distribution of sepal length across all the species of flowers. One way to do this is to visualise this relation through the graph shown below.

2) Which function can be used to produce the graph shown above?

A) xyplot()
B) stripplot()
C) barchart()
D) bwplot()

The plot above is of type strip whereas the options a, c and d will produce a scatter, bar and box whisker plot respectively. Therefore, option B is the correct solution.

Question Context 3

Alpha 125.5 0
Beta 235.6 1
Beta 212.03 0
Beta 211.30 0
Alpha 265.46 1

3) Which of the following commands will correctly read the above csv file with 5 rows in a dataframe?

Options 1 and 2 will read the first row of the above dataframe as header. Option 3 doesn’t exist. Therefore, option D is the correct solution.

Question Context 4

Excel file format is one of the most common formats used to store datasets. It is important to know how to import an excel file into R. Below is an excel file in which data has been entered in the third sheet.

Alpha 125.5 0
Beta 235.6 1
Beta 212.03 0
Beta 211.30 0
Alpha 265.46 1

File Name – Dataframe.xlsx

4) Which of the following codes will read the above data in the third sheet into a dataframe in R?

All of the above options are true, as they give out different methods to read an excel file into R and reads the above file correctly. Therefore, option D is the correct solution.

Question Context 5

A 10 Sam
B 20 Peter
C 30 Harry
D ! ?
E 50 Mark

File Name – Dataframe.csv

5) Missing values in this csv file has been represented by an exclamation mark (“!”) and a question mark (“?”). Which of the codes below will read the above csv file correctly into R?

B) csv(‘Dataframe.csv’,header=FALSE, sep=’,’,na.strings=c(‘?’))

Option A will not be able to read “?” and “!” as NA in R. option B will be able to read only “?” as NA but not “!”. Option 4 doesn’t exist. Therefore, option C is the correct solution.

Question Context 6-7

Column 1 Column 2 Column 3
Row 1 15.5 14.12 69.5
Row 2 18.6 56.23 52.4
Row 3 21.4 47.02 63.21
Row 4 36.1 56.63 36.12

File Name – Dataframe.csv

6) The above csv file has row names as well as column names. Which of the following code will read the above csv file properly into R?

B) csv2(‘Train.csv’,header=TRUE, row.names=TRUE)

Solution: (D)

row.names argument in options A and B takes only the vector containing the actual row names or a single number giving the column of the table which contains the row names and not a logical value. Option C doesn’t exist. Therefore, option D is the correct solution.

Question Context 6-7

Column 1 Column 2 Column 3
Row 1 15.5 14.12 69.5
Row 2 18.6 56.23 52.4
Row 3 21.4 47.02 63.21
Row 4 36.1 56.63 36.12

File Name – Dataframe.csv

7) Which of the following codes will read only the first two rows of the csv file?

Option B will not be able to read the csv file correctly since the default separator in csv2 function is “” whereas csv files are of type “,”. Option C has wrong header argument value. Option D doesn’t exist. Therefore, Option A is the correct answer.

Question Context 8

8) There are two dataframes stored Dataframe1 and Dataframe2 shown above. Which of the following codes will produce the output shown below?

Feature1 Feature2 Feature3
A 1000 25.5
B 2000 35.5
C 3000 45.5
D 4000 55.5
E 5000 65.5
F 6000 75.5
G 7000 85.5
H 8000 95.5

Solution: (D)

Option C will result in feature 4 being included in the merged dataframe which is what we do not want. Therefore, Option D is the correct solution.

Question Context 9

V1 V2
1 121.5 461
2 516 1351
3 451 6918
4 613 112
5 112.36 230
6 25.23 1456
7 12 457

9) A data set has been read in R and stored in a variable “dataframe”. Which of the below codes will produce a summary (mean, mode, median) of the entire dataset in a single line of code?

Solution: (E)

Option A will give only the mean and the median but not the mode. Option B, C and D will also fail to provide the required statistics. Therefore, Option E is the correct solution.

Question Context 10

A dataset has been read in R and stored in a variable “dataframe”. Missing values have been read as NA.

A 10 Sam
B NA Peter
C 30 Harry
D 40 NA
E 50 Mark

10) Which of the following codes will not give the number of missing values in each column?

C) sapply(dataframe,function(x) sum(is.na(x))

Solution: (D)

Option D will give the overall count of the missing values but not column wise. Therefore, Option D is the correct solution.

Question context 11

One of the important phase in a Data Analytics pipeline is univariate analysis of the features which includes checking for the missing values and the distribution, etc. Below is a dataset and we wish to plot histogram for “Value” variable.

Parameter State Value Dependents
Alpha Active 50 2
Beta Active 45 5
Beta Passive 25 0
Alpha Passive 21 0
Alpha Passive 26 1
Beta Active 30 2
Beta Passive 18 0

11) Which of the following commands will help us perform that task ?

Solution: (D)

All of the given options will plot a histogram and that can be used to see the skewness of the desired data.

Question Context 12

Parameter State Value Usage
Alpha Active 50 0
Beta Active 45 1
Beta Passive 25 0
Alpha Passive 21 0
Alpha Passive 26 1
Beta Active 30 1
Beta Passive 18 0

Certain Algorithms like XGBOOST work only with numerical data. In that case, categorical variables present in dataset are first converted to DUMMY variables which represent the presence or absence of a level of a categorical variable in the dataset. For example After creating the Dummy Variable for the feature “Parameter”, the dataset looks like below.

Parameter_Alpha Parameter_Beta State Value Usage
1 0 Active 50 0
0 1 Active 45 1
0 1 Passive 25 0
1 0 Passive 21 0
1 0 Passive 26 1
0 1 Active 30 1
0 1 Passive 18 0

12) Which of the following commands will help us to achieve this?

A) dummies:: dummy.data.frame(dataframe,names=c(‘Parameter’))

Solution: (D)

Option C will encode the Parameter column will 2 levels but will not perform one hot encoding. Therefore, option D is the correct solution.

Question context 13

Column1 Column2 Column3 Column4 Column5 Column6
Name1 Alpha 12 24 54 0 Alpha
Name2 Beta 16 32 51 1 Beta
Name3 Alpha 52 104 32 0 Gamma
Name4 Beta 36 72 84 1 Delta
Name5 Beta 45 90 32 0 Phi
Name6 Alpha 12 24 12 0 Zeta
Name7 Beta 32 64 64 1 Sigma
Name8 Alpha 42 84 54 0 Mu
Name9 Alpha 56 112 31 1 Eta

13) We wish to calculate the correlation between “Column2” and “Column3” of a “dataframe”. Which of the below codes will achieve the purpose?

(sum(dataframe$Column2*dataframe$Column3)- (sum(dataframe$Column2)*sum(dataframe$Column3)/nrow(dataframe)))/(sqrt((sum(dataframe$Column2*dataframe$Column2)-(sum(dataframe$Column2)^3)/nrow(dataframe))* (sum(dataframe$Column3*dataframe$Column3)-(sum(dataframe$Column3)^2)/nrow(dataframe))))

In option A, corr is the wrong function name. Actual function name to calculate correlation is cor. In option B, it is the standard deviation which should be the denominator and not variance. Similarly, the formula in Option C is wrong. Therefore, Option D is the correct solution.

Question Context 14

Parameter State Value Dependents
Alpha Active 50 2
Beta Active 45 5
Beta Passive 25 0
Alpha Passive 21 0
Alpha Passive 26 1
Beta Active 30 2
Beta Passive 18 0

14) The above dataset has been loaded for you in R in a variable named “dataframe” with first row representing the column name. Which of the following code will select only the rows for which parameter is Alpha?

A) subset(dataframe, Parameter=’Alpha’)

B) subset(dataframe, Parameter==’Alpha’)

In option A, there should be an equality operator instead of the assignment operator. Therefore, option D is the correct solution.

15) Which of the following function is used to view the dataset in spreadsheet like format?

Solution : (B)

Option B is the only option that will show the dataset in the spreadsheet format. Therefore, option B is the correct solution.

Question Context 16

The below dataframe is stored in a variable named data.

A B
1 Right
2 Wrong
3 Wrong
4 Right
5 Right
6 Wrong
7 Wrong
8 Right

16) Suppose B is a categorical variable and we wish to draw a boxplot for every level of the categorical level. Which of the below commands will help us achieve that?

Boxplot function in R requires a formula input to draw different boxplots by levels of a factor variable. Therefore, Option B is the correct solution.

17) Which of the following commands will split the plotting window into 4 X 3 windows and where the plots enter the window column wise.

mfcol argument will ensure that the plots enter the plotting window column wise. Therefore, Option B is the correct solution.

Question Context 18

A Dataframe “df” has the following data:

After reading above data, we want the following output:

18) Which of the following commands will produce the desired output?

Solution: (D)

None of the above options will produce the desired output. Therefore, Option D is the correct solution.

19) Which of the following command will help us to rename the second column in a dataframe named “table” from alpha to beta?

Solution: (D)

All of the above options are different methods to rename the column names of a dataframe.Therefore, option D is the correct solution.

Question Context: 20

A majority of work in R uses systems internal memory and with large datasets, situations may arise when the R workspace cannot hold all the R objects in memory. So removing the unused objects is one of the solution.

20) Which of the following command will remove an R object / variable named “santa” from the workspace?

A) remove(santa)
B) rm(santa)
C) Both
D) None

Solution : (C)

remove and rm , both can be used to clear the workspace. Therefore, option C is the correct solution.

21) “dplyr” is one of the most popular package used in R for manipulating data and it contains 5 core functions to handle data. Which of the following is not one of the core functions of dplyr package?

Solution: (D)

summary is a function in the R base package and not dplyr.

Context – Question 22

During Feature Selection using the following dataframe (named table), “Column1” and “Column2” proved to be non-significant. Hence, we would not like to take these two features into our predictive model.

Column1 Column2 Column3 Column4 Column5 Column6
Name1 Alpha 12 24 54 0 Alpha
Name2 Beta 16 32 51 1 Beta
Name3 Alpha 52 104 32 0 Gamma
Name4 Beta 36 72 84 1 Delta
Name5 Beta 45 90 32 0 Phi
Name6 Alpha 12 24 12 0 Zeta
Name7 Beta 32 64 64 1 Sigma
Name8 Alpha 42 84 54 0 Mu
Name9 Alpha 56 112 31 1 Eta

22) Which of the following commands will select all the rows from column 3 to column 6 for the below dataframe named table?

Option A, B and C are different column sub setting methods in R. Therefore, option D is the correct solution.

Context Question 23-24

Column1 Column2 Column3 Column4 Column5 Column6
Name1 Alpha 12 24 54 0 Alpha
Name2 Beta 16 32 51 1 Beta
Name3 Alpha 52 104 32 0 Gamma
Name4 Beta 36 72 84 1 Delta
Name5 Beta 45 90 32 0 Phi
Name6 Alpha 12 24 12 0 Zeta
Name7 Beta 32 64 64 1 Sigma
Name8 Alpha 42 84 54 0 Mu
Name9 Alpha 56 112 31 1 Eta

23) Which of the following commands will select the rows having “Alpha” values in “Column1” and value less than 50 in “Column4”? The dataframe is stored in a variable named table.

A) dplyr::filter(table,Column1==’Alpha’, Column4<50)

B) dplyr::filter(table,Column1==’Alpha’ & Column4<50)

Solution: (C)

  1. filter function in dplyr package uses “,” and “&” to add the condition. Therefore, Option C is the correct solution.

Question Context 23-24

Column1 Column2 Column3 Column4 Column5 Column6
Name1 Alpha 12 24 54 0 Alpha
Name2 Beta 16 32 51 1 Beta
Name3 Alpha 52 104 32 0 Gamma
Name4 Beta 36 72 84 1 Delta
Name5 Beta 45 90 32 0 Phi
Name6 Alpha 12 24 12 0 Zeta
Name7 Beta 32 64 64 1 Sigma
Name8 Alpha 42 84 54 0 Mu
Name9 Alpha 56 112 31 1 Eta

24) Which of the following code will sort the dataframe based on “Column2” in ascending order and “Column3” in descending order?

Solution: (C)

Both order and arrange functions can be used to order the columns in R. Therefore, Option C is the correct solution.

25) Dealing with strings is an important part of text analytics and splitting a string is often one of the common task performed while creating tokens, etc. What will be the output of following commands?

Solution : (B)

c(A.B) would concatenate A=”alpha beta gamma” and B=”phithetazeta” separated by a white space. Upon using strsplit, the two strings will be separated at the white space between A and B into two lists. Parts[[1]][2] tells us to print the second sub element of the first element of the list which is “beta”. Therefore, option B is the correct solution.

26) What will be the output of the following command

A) [FALSE TRUE TRUE FALSE TRUE]

B) [FALSE TRUE TRUE FALSE FALSE]

C) [FALSE FALSE TRUE FALSE FALSE]

Solution: (C)

The above command will go for the exact match of the passed argument and therefore Option C is the correct solution.

Question Context 27

Sometimes as a Data Scientist working on textual data we come across instances where we find multiple occurrences of a word which is unwanted. Below is one such string.

Solution: (A)

sub command will replace only the first occurrence in a string whereas regexec will return a list of positions of the match or -1 if no match occurs. Therefore, Option A is the correct solution.

28) Imagine a dataframe created through the following code.

Which of the following command will help us remove the duplicate rows based on both the columns?

All the above methods are different ways of removing the duplicate rows based on both the columns. Therefore, Option D is the correct solution.

Question Context 29

Grouping is an important activity in Data Analytics and it helps us discover some interesting trends which may not be visible easily in the raw data.

Suppose you have a dataset created by the following lines of code.

29) Which of the following command will help us to calculate the mean bar value grouped by foo variable?

All the above methods are used to calculate the grouped statistic of a column. Therefore, Option D is the correct solution.

30) If I have two vectors x<- c(1,3, 5) and y<-c(3, 2), what is produced by the expression cbind(x, y)?

A) a matrix with 2 columns and 3 rows

B) a matrix with 3 columns and 2 rows

C) a data frame with 2 columns and 3 rows

D) a data frame with 3 columns and 2 rows

Solution: (D)

All of the above options define messy data and hence Option D is the correct solution.

31) Which of the following commands will convert the following dataframe named maverick into the one shown at the bottom?

Input Dataframe – “maverick”

Grade Male Female
A 10 15
B 20 15
A 30 35

Output dataframe

Grade Sex Count
A Male 10
A Female 15
B Male 30
B Female 15
A Male 30
A Female 35

A) tidyr::Gather(maverick, Sex,Count,-Grade)

B) tidyr::spread(maverick, Sex,Count,-Grade

C) tidyr::collect(maverick, Sex,Count,-Grade)

Solution: (A)

Spread command converts rows into columns whereas there is no collect command in tidyr or base package.

Therefore, Option A is the correct solution.

32) Which of the following command will help us to replace every instance of Delhi with Delhi_NCR in the following character vector?

Though sub command only replaces the first occurrence of a pattern. In this case, strings have just a single appearance of Delhi. Hence, both gsub and sub command will work in this situation. Therefore, Option C is the correct solution.

Question Context 33

Sometimes creating a feature which represents whether another variable has missing values or not can prove to be very useful for a predictive model.

Below is a dataframe which has missing values in one of its columns.

Feature1 Feature2
B NA
C 30
D 40
E 50


33) Which of the following commands will create a column named “missing” with value 1 where variable “Feature2” has missing values?

Feature1 Feature2 Missing
B NA 1
C 30 0
D 40 0
E 50 0

Option C is the correct answer.

34) Suppose there are 2 dataframes “A” and “B”. A has 34 rows and B has 46 rows. What will be the number of rows in the resultant dataframe after running the following command?

all.x forces the merging to take place on the basis of A and hence will contain the same number of rows as of A. Therefore, Option C is the correct solution.

Question context 35

The very first thing that a Data Scientist generally does after loading dataset is find out the number of rows and columns the dataset has. In technical terms, it is called knowing the dimensions of the dataset. This is done to get an idea about the scale of data that he is dealing with and subsequently choosing the right techniques and tools.

35) Which of the following command will not help us to view the dimensions of our dataset?

Solution: (C)

View command will print the dataset to the console in a spreadsheet like format but will not help us to view the dimensions. Therefore, option C is the correct solution.

Question context 36

Sometimes, we face a situation where we have two columns of a dataset and we wish to know which elements of the column are not present in another column. This is easily achieved in R using the setdiff command.

Column1 Column2 Column3 Column4 Column5 Column6
Name1 Alpha 12 24 54 0 Zion
Name2 Beta 16 32 51 1 Beta
Name3 Alpha 52 104 32 0 Gamma
Name4 Beta 36 72 84 1 Delta
Name5 Beta 45 90 32 0 Phi
Name6 Alpha 12 24 12 0 Zeta
Name7 Beta 32 64 64 1 Sigma
Name8 Alpha 42 84 54 0 Mu
Name9 Alpha 56 112 31 1 Eta

36) What will be the output of the following command?

Solution: (B)

The order of arguments matter in setdiff function. Therefore, option B is the correct solution.

Question Context 37

The below dataset is stored in a variable called “frame”.

A B
alpha 100
beta 120
gamma 80
delta 110

37) Which of the following commands will create a bar plot for the above dataset. Use the values from Column B to represent the height of the bar plot.

stat=”identity” will ensure the values in column B become the height of the bar. Therefore, Option A is the correct solution.

Question Context 38

A mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

38) We wish to create a stacked bar chart for cyl variable with stacking criteria Being vs Variable. Which of the following commands will help us perform this action?

Both options A and B will create a stacked bar chart guided by the “fill” parameter. Therefore, option C is the correct solution.

39) What is the output of the command – paste(1:3,c(“x”,”y”,”z”),sep=””) ?

Question Context 40

R has a rich library reserve for drawing some of the very high end graphs and plots and many a times you want to save the graphs for presenting your findings to someone else. Saving your plots to a PDF file is one such option.

40) If you want to save a plot to a PDF file, which of the following is a correct way of doing that?

A) Construct the plot on the screen device and then copy it to a PDF file with dev.copy2pdf().

B) Construct the plot on the PNG device with png(), then copy it to a PDF with dev.copy2pdf().

C) Open the PostScript device with postscript(), construct the plot, then close the device with dev.off().

D) Open the screen device with quartz(), construct the plot, and then close the device with dev.off().

The plots are first created on the screen device and then can be copied easily to a pdf file. Therefore, option A is the correct solution.

End Notes

If you are learning R, you should use the test above to check your skills in R. If you have any questions or doubts, feel free to post them below.

Learn, compete, hack and get hired!


Variable Types in Data Science and Statistical Analysis

An optimized solution to a real-world problem modeled as a data science use case depends on a multitude of factors. Most important of those would be exploratory data analysis, feature engineering and algorithm selection. All of these depend heavily on understanding of the data as a whole, independent variables(features) and dependent variable (outcome variable).

From statistical perspective, analyzing the datasets corresponding to a typical data science problem will show that values of these variables fall broadly under 2 categories — categorical or numeric. Categorical variables could be binary, nominal or ordinal whereas numeric variables could be discrete or continuous.

CATEGORICAL VARIABLES

Dichotomous (or Binary) Variables – Values corresponding to such variables fall under only 2 categories. Example: If a particular variable documents the responses to a question ‘Have you ever been to Rome’ with two answer option “Yes” or “No”, then that can be called a binary variable.

Nominal Variables — Values fall under two or more categories, but with no specific order. Example: If a variable documents the responses to a question ‘Name the Country in which you reside’, there could be many distinct answers to that question and the answers will not have any order assigned to them. This can be an example of a nominal variable.

Ordinal Variables — Values corresponding to ordinal variables fall under 2 or more categories like nominal variables, but the categories will follow a certain intrinsic order. Example: If a variable corresponds to a person’s highest education level and can take values High School, Associate Degree, Bachelor’s, Master’s, Ph.D etc., then that can be considered as an ordinal variable following a specific order from lowest education level(High School) to the highest education level (Ph.D).

NUMERIC VARIABLES

Discrete Variables — Discrete numeric variables typically follow a discrete statistical distribution and can take only specific numeric values. Example: If a variable corresponds to the different possible outcomes from rolling a dice, there could be only 6 possible values — from 1 to 6. This is an example of a discrete numeric variable.

Continuous Variables — Continuous numeric variables follow a continuous distribution and can take any real numerical value in a finite or infinite range of values. Example: If a variable documents the body temperature of a person, the possible values could be 99.20 F, 97.90 F, 102.40 F etc. and can be an example of a continuous numeric variable.

Which type of the above variable types are most commonly seen in data sets used for machine learning or data science? Categorical, numeric, combination of both — answers might vary based on individual data scientist’s experience. While the data set that the data scientist starts with might have all these different types of variables, it is important to do effective feature selection to pick what is important for the use case at hand and do feature engineering to convert one form to another whenever necessary to make sure that the machine learning model achieves optimum performance.


7.7 ggplot2 calls

As we move on from these introductory chapters, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:

Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to ggplot() are data and mapping , and the first two arguments to aes() are x and y . In the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back in functions.

Rewriting the previous plot more concisely yields:

Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from %>% to + . I wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.


3.7 Visualizing data in 2D: scatterplots

Scatterplots are useful for visualizing treatment–response comparisons (as in Figure 3.3), associations between variables (as in Figure 3.10), or paired data (e.g., a disease biomarker in several patients before and after treatment). We use the two dimensions of our plotting paper, or screen, to represent the two variables. Let’s take a look at differential expression between a wildtype and an FGF4-KO sample.

Figure 3.25: Scatterplot of 45101 expression measurements for two of the samples.

The labels 59 E4.5 (PE) and 92 E4.5 (FGF4-KO) refer to column names (sample names) in the dataframe dfx , which we created above. Since they contain special characters (spaces, parentheses, hyphen) and start with numerals, we need to enclose them with the downward sloping quotes to make them syntactically digestible for R. The plot is shown in Figure 3.25. We get a dense point cloud that we can try and interpret on the outskirts of the cloud, but we really have no idea visually how the data are distributed within the denser regions of the plot.

One easy way to ameliorate the overplotting is to adjust the transparency (alpha value) of the points by modifying the alpha parameter of geom_point (Figure 3.26).

Figure 3.26: As Figure 3.25, but with semi-transparent points to resolve some of the overplotting.

This is already better than Figure 3.25, but in the more dense regions even the semi-transparent points quickly overplot to a featureless black mass, while the more isolated, outlying points are getting faint. An alternative is a contour plot of the 2D density, which has the added benefit of not rendering all of the points on the plot, as in Figure 3.27.

Figure 3.27: As Figure 3.25, but rendered as a contour plot of the 2D density estimate.

However, we see in Figure 3.27 that the point cloud at the bottom right (which contains a relatively small number of points) is no longer represented. We can somewhat overcome this by tweaking the bandwidth and binning parameters of geom_density2d (Figure 3.28, left panel).

Figure 3.28: Left: as Figure 3.27, but with smaller smoothing bandwidth and tighter binning for the contour lines. Right: with color filling.

We can fill in each space between the contour lines with the relative density of points by explicitly calling the function stat_density2d (for which geom_density2d is a wrapper) and using the geometric object polygon, as in the right panel of Figure 3.28.

We used the function brewer.pal from the package RColorBrewer to define the color scale, and we added a call to coord_fixed to fix the aspect ratio of the plot, to make sure that the mapping of data range to (x) - and (y) -coordinates is the same for the two variables. Both of these issues merit a deeper look, and we’ll talk more about plot shapes in Section 3.7.1 and about colors in Section 3.9.

The density based plotting methods in Figure 3.28 are more visually appealing and interpretable than the overplotted point clouds of Figures 3.25 and 3.26, though we have to be careful in using them as we lose much of the information on the outlier points in the sparser regions of the plot. One possibility is using geom_point to add such points back in.

But arguably the best alternative, which avoids the limitations of smoothing, is hexagonal binning (Carr et al. 1987) .

Figure 3.29: Hexagonal binning. Left: default parameters. Right: finer bin sizes and customized color scale.

3.7.1 Plot shapes

Choosing the proper shape for your plot is important to make sure the information is conveyed well. By default, the shape parameter, that is, the ratio between the height of the graph and its width, is chosen by ggplot2 based on the available space in the current plotting device. The width and height of the device are specified when it is opened in R, either explicitly by you or through default parameters 47 47 See for example the manual pages of the pdf and png functions. . Moreover, the graph dimensions also depend on the presence or absence of additional decorations, like the color scale bars in Figure 3.29.

There are two simple rules that you can apply for scatterplots:

If the variables on the two axes are measured in the same units, then make sure that the same mapping of data space to physical space is used – i.e., use coord_fixed . In the scatterplots above, both axes are the logarithm to base 2 of expression level measurements that is, a change by one unit has the same meaning on both axes (a doubling of the expression level). Another case is principal component analysis (PCA), where the (x) -axis typically represents component 1, and the (y) -axis component 2. Since the axes arise from an orthonormal rotation of input data space, we want to make sure their scales match. Since the variance of the data is (by definition) smaller along the second component than along the first component (or at most, equal), well-made PCA plots usually have a width that’s larger than the height.

If the variables on the two axes are measured in different units, then we can still relate them to each other by comparing their dimensions. The default in many plotting routines in R, including ggplot2, is to look at the range of the data and map it to the available plotting region. However, in particular when the data more or less follow a line, looking at the typical slope of the line can be useful. This is called banking (William S. Cleveland, McGill, and McGill 1988) .

To illustrate banking, let’s use the classic sunspot data from Cleveland’s paper.

Figure 3.30: The sunspot data. In the upper panel, the plot shape is roughly quadratic, a frequent default choice. In the lower panel, a technique called banking was used to choose the plot shape. (Note: the placement of the tick labels is not great in this plot and would benefit from customization.)

The resulting plot is shown in the upper panel of Figure 3.30. We can clearly see long-term fluctuations in the amplitude of sunspot activity cycles, with particularly low maximum activities in the early 1700s, early 1800s, and around the turn of the 20 (^ ext) century. But now lets try out banking.

How does the algorithm work? It aims to make the slopes in the curve be around one. In particular, bank_slopes computes the median absolute slope, and then with the call to coord_fixed we set the aspect ratio of the plot such that this quantity becomes 1. The result is shown in the lower panel of Figure 3.30. Quite counter-intuitively, even though the plot takes much smaller space, we see more on it! In particular, we can see the saw-tooth shape of the sunspot cycles, with sharp rises and more slow declines.


10.3 Printing

Data frames have a refined print method that shows only the first and last 5 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

Data frames are designed so that you don’t accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.

First, you can return the data frame using .head() on the data frame and control the number of rows ( n ) of the display. In the interactive Python viewer in VS Code you can scroll to see the other columns.

You can also control the default print behaviour by setting options:

pd.set_option("display.max_rows", 101) : if more than 101 rows, print only n rows.

pd.set_option('precision', 5) will set the number of decimals that are shown.

You can see a complete list of options by looking at the pandas help.

10.3.1 Subsetting

So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, [ . [ can extract by name or position.


Watch the video: Πεπτικό Σύστημα για μικρά παιδιά (June 2022).