Part 2: Phenotyping Assays

In Marker Assisted Breeding (MAB) phenotypic data will usually be used together with genotypic data to find statistical correlations between the phenotype data and specific molecular markers. These correlations help identify regions of the genome (which can be specific genes, but not necessarily) that are associated with any particular trait. The statistical significance level (e.g. how strong the correlation is) and reproducibility of the marker-trait correlations identified are only as good as the assay that was used for phenotyping. There are many possible ways to measure most traits. Some examples are:

  • Diseases - measure by eye (e.g. % diseased) or ELISA, etc.
  • Fruit colour - spectrophotometer, visualization
  • Yield - number of fruit, weight of whole harvest, etc.
  • Fruit or grain size - measure, number per unit weight

 Part 3. Data types in Phenotypes

When collecting phenotypic data, there are two different types you may utilize:  1) categorical and 2) continuous. It is important to understand the different kinds of data types as different statistical analyses may be used depending on the type of data. Some software programs that you may use to analyze your data may ask you to identify what kind of phenotyping data you have. We will now describe both of these types of data.

Categorical data is, as you might guess, data that fall into discrete categories, or classes. There are 2 main types of categorical data, nominal and ordinal.  

Nominal data has no natural order or relationship. In the fruit shape example below (Illustration 3). , a score of 2 does not imply more than or better than a score of 1. The classes are just different, with no particular relationship with each other.

Figure 2a/Illustration 3a: Data with no natural relationship

To find a marker correlated with this type of data, you need only a test of independence (e.g. chi-square). For example in illustration 3a, where we have 3 different fruit shape types, a test of independence will tell you whether the allele combination at each marker is correlated with round, blocky or long fruit, or is independent of any relationship with fruit shape. For Marker Assisted Breeding, this information is important in identifying which regions of the genome (as identified by the molecular markers you are using), might be associated with fruit shape.    

The second type of categorical data is ordinal.  In contrast to nominal, ordinal data has some natural order (the scores are somewhat related).For example, if you score fruit size on a scale of 1-5, where 1 = smallest and 5 = biggest, a score of 2 means the fruit are larger than fruit scored 1 (Illustration 3b). There is a relationship between the scores, a natural order.

Figure 3/Illustration 3b: Related Data

For statistical analysis of this kind of data you need an association test (e.g. Kendall’s tau statistic; Kendall 1938).

In contrast with categorical data, continuous data do not fall into discrete categories.  Instead they produce a continuous distribution (Illustration 4). Some of these distributions can be modeled algebraically. The two most common ones are Poisson and Gaussian (or normal).

Examples of continuous traits include yield, size, nutrients, etc. Certainly you could score these traits on a scale, but if you measure the exact quantities (of yield, for example), they will not fall into clear classes.

A perfect normal distribution (Illustration 4a) as compared to a histogram of yield data from a QTL experiment (Illustration 4b). Data points fall into a continuous range, not discrete classes.

Identifying correlations for this type of data requires more complex calculations such as regressions.

Figure 4/Illustration 4a: Normal distribution Image from http://mathworld.wolfram.com/PoissonDistribution.html,

Figure 5/Illustration 4b: Histogram of yield Image from T. Fulton using QGene software (http://www.qgene.org).