A large data set from Washington (140,000 flower samples) is scrutinized here for evidence of consistently propagated strains with higher than a 2-to-1 ratio.
Strikingly similar results have been reported from a wide range of studies on the ratio of tetrahydrocannabinol (THC) to cannabidiol (CBD) concentrations in strains of cannabis. Whether the source has been legalized markets in the west, medical markets in the U.S. and Canada, or collections from law enforcement and researchers, three easily distinguishable types of plant have consistently been found: THC-dominant strains (with less than 1% CBD); CBD-dominant strains (less than 1% THC); and balanced strains with comparable concentrations of both substances. Another consistent finding of these studies, carried out in a variety of laboratory settings, is a positive correlation between THC and CBD levels in those plants that can make substantial quantities (>1%) of each. The correlation between THC and CBD quantities in these varied populations suggests that there is a fundamental property of the plant that makes some combinations impossible, for instance, >15% THC and also >5% CBD. Such results have never shown up in published data sets of carefully, consistently tested samples, but those were all relatively small collections. A much larger data set has been released by the state of Washington (140,000 flower samples), and this has been scrutinized for evidence of consistently propagated strains with higher than a 2-to-1 ratio.
As cannabis emerges from the shadows of the prohibition era into the mainstream wellness market, two corresponding forces are changing the way the product is discussed and evaluated: health-conscious consumers are looking for specific benefits from the product, not simply a “high;” and the simple classifications of products that have been traditionally provided to consumers are being replaced by sophisticated biochemical and genetic evaluations.
Cannabis products today are generally sold under colorful strain names, often with combination names that refer to the predecessor strains from which the new strain was bred. In addition, cannabis products are frequently categorized as being of indica or sativa origin, or more likely as being a blend with a certain percentage of indica and sativa.
For consumers who are looking for specific health effects from cannabis, it is important that they are confident that repurchasing a product with the same, or related, strain name will give them the same positive experience they are seeking (and, conversely, that by avoiding other strain names they will not repeat a negative experience). As more and more sophisticated data is collected on the contents of cannabis products and careful investigation made of the reliability of names and classifications, it is becoming increasingly clear that our current naming systems for cannabis products is inadequate to reliably guide purchasers who are seeking reproducible experiences.
The terms indica and sativa date back more than 200 years. Originally they described the differences in plant shape and leaf pattern, but more recently they have been used as shorthand terms for cannabis varieties that were considered more sedating or more stimulating. After generations of cross-breeding, and with abundant opportunity for misidentification and mislabeling, the validity of these terms as a means of distinguishing expected effects has been called into question. Over the past few years, a number of rigorous studies of the biochemical properties of plants have reached the conclusion that the traditional indica and sativa terminology are not reliable indicators of the property that is of greatest interest to consumers: the cannabinoid profile.
Molecular genetics techniques, not unlike those used to trace human ancestry, are now being used to investigate plants at the level of their DNA to determine how closely related they are. One study published this year in the Journal of Cannabis Research (1) from the University of Northern Colorado examined the relatedness of 120 samples of cannabis from three legal Western markets, covering 30 named strains. Each strain studied was categorized according to its rating on the indica-sativa continuum as recorded by Leafly.com. Analysis of the strains’ genetic properties revealed poor alignment to the traditional labeling, leading the authors to conclude that “…there is no consistent genetic differentiation between the widely-held perceptions of Sativa and Indica Cannabis types” (1). Their conclusion corresponds with earlier work, notably a study from Sawler and colleagues (2), as well as another study from the University of British Colombia (3) and a study of the strains available in the Nevada medical market (4) that found much greater homogeneity in the genetic and cannabinoid profile than would be suggested by the profusion of strain names.
The accumulating evidence of the unreliability of the indica and sativa terminology has been a factor in the decision by Leafly to no longer categorize cannabis strains by those terms (5). Instead, that organization has recognized that the cannabinoid profile of cannabis plants is better described by one of three terms: tetrahydrocannabinol (THC)-dominant, cannabidiol (CBD)-dominant, or balanced. The evidence that supports this fundamental distinction among all cannabis plants is detailed in the next section.
A range of studies during the past 15 years have established that the relative quantities of THC and CBD in cannabis plants occurs in only one of three patterns. An assortment of investigations has produced this insight: studies of genetic crosses; studies of the DNA sequences of cannabinoid synthesizing enzymes; and numerous studies of the cannabinoid profile of strain collections from medical and recreational markets. A wide range of such potency studies, using very different collections of strains, has produced strikingly similar results.
The great similarity in the results of studies of cannabinoid ratios in collections of strains from a wide range of sources points to a fundamental property of cannabis plants that is found no matter where they are cultivated. Across these many strain sources and testing laboratories, the consistent finding is that the relative quantities of THC and CBD can only be in one of three patterns: either there are equivalent concentrations of both substances; or one compound is dominant, with only trace levels of the other. Evidence supporting this three-part classification is presented here from five studies. Each of these studies used a scatter plot to display the relative amounts of THC and CBD in individual strains, and in each of these charts the data points separate neatly into three clusters.
The earliest study in this group was a 2003 cross-breeding experiment by de Meijer and colleagues that examined the offspring of high THC plants crossed with high CBD plants, and their progeny (6). The scatter plots (Figure 1) of the THC (y-axis) and CBD content (x-axis) show three distinct clusters. High THC plants (with little CBD) are clustered on the y-axis, the high CBD (and low THC) plants are clustered on the x-axis, and the cluster that sloped up the middle of the chart had comparable quantities of each. The 1:2:1 ratio of the three types in the progeny of the crosses was interpreted by the authors as evidence for one genetic locus, with two codominant alleles, that control the relative quantities of THC and CBD.
A study from the University of Indiana (7) examined 157 strains of cannabis that were collected from breeders, researchers, gene banks, and law enforcement. The cannabinoid profile in that disparate collection of strains conforms very closely to the three cluster result of the cross-breeding study (Figure 2). Similarly, a 2015 study of 210 strains from the Canadian medical program’s License Providers demonstrated the same clustering of three types in scatter plots (Figure 3) of THC and CBD concentrations (with the x- and y-axes reversed from the earlier figures). A comparable study of 245 strains tested in New Jersey’s medical marijuana program (8) revealed the same pattern of three clusters: THC-dominant, balanced, and CBD-dominant strains, though the vast majority of the strains tested in the New Jersey medical program were the THC-dominant type (Figure 4). (See upper right for Figures 1-4, click to enlarge.)
The largest collection of strain testing results came from the state of Washington’s recreational program (9). In this study, published by Jikomes and Zoorob, more than 160,000 test results from 23 independent laboratories were reviewed. This effort produced a wealth of data on consistency and variation in strain identity, as well as on laboratory-to-laboratory consistency. A scatter plot of THC versus CBD concentrations resulted in the three-cluster arrangement that mirrors the results of other studies (Figure 5). These authors used the relative concentrations to classify each strain as being of Chemovar Type I (high THC), Type II (balanced), or Type III (high CBD). (See upper right for Figure 5, click to enlarge.)
The common pattern of three distinct clusters on these scatter plots provides strong evidence that this is a fundamental property of cannabis plants. The commonality of the result is all the more striking when the range of sources, both in material and methods, is considered. The laboratories that conducted the cited studies were: a single state-run laboratory; a single academic laboratory; commercial laboratories serving a single medical market; and numerous commercial laboratories serving a fast-growing recreational market. The material tested came from highly controlled medical markets, a recreational market that responds to varied market demand, and strain collections that pre-date the current liberalized legal environment. For all of these studies to produce such similar results strongly suggests that there are very few patterns possible in the ratio of cannabinoids.
The clarity of the three distinct clusters in these studies is in part because of the very striking distribution of the data points in the middle of the scatter plots. In each of the scatter plots, the middle cluster, from strains that contain appreciable quantities of both THC and CBD, is tightly bunched and appears centered on an upward-sloping line.
The upward slope of the center cluster might be casually overlooked, or thought of as expected and “normal,” but such a consistent finding, especially across such varied studies, merits some scrutiny. As THC and CBD (more specifically, their acidic forms tetrahydrocannabinolic acid [THCA] and cannabidiolic acid [CBDA]) are synthesized from the same precursor molecule cannabigerol acid (CBGA), a naive assumption might be that as the plant makes more THCA it would make less CBDA, and vice versa. If that were so, the points in the middle of the scatter plots would have a downward slope, a reverse correlation. But the clear result, and one that appears to be common to all cannabis strains, is that in balanced-type plants that can accumulate significant (>1%) concentrations of both THC and CBD, the two substances are present in proportional quantities. Balanced plants with high THC also have high CBD (toward the top right of the scatter plot), and those with lower THC have lower CBD (toward the bottom left of the center cluster).
The clarity of the clustering in the scatter plots also calls attention to the white spaces, and raises the question of whether there are certain combinations of cannabinoids that are impossible to find in a single plant. Note the sectors in these plots do not contain any data points: above a certain concentration of THC, it appears that the CBD level must be miniscule, so that a strain with 15% THC and 5% CBD is not found on these plots. This is most clear in the plots with fewer data points, and less clear in the plot of thousands of data points from the state of Washington. A principal finding, though, of that study on the state of Washington data, was evidence for reporting inflated THC levels. The question remains, then, whether certain combinations of cannabinoids simply have not been reported, or are actually not possible. A closer examination of the Washington data may reveal whether it contains good evidence for some combinations of cannabinoid concentrations that have not been reported elsewhere, in smaller collections, or whether they represent “noise” in a very large and poorly controlled testing environment.
Are Strains with >15% THC and >5% CBD Possible?
To search for cannabis strains that posses a rare combination of THC and CBD, that have not been reported in small collections, a thorough re-examination has been made of the full database of test results made available by the Washington State Liquor and Cannabis Board. This collection was produced for the study by Jikomes and Zoorob (9) and they have made it available for others to query. The in-between data points, those that are not easily assigned to the three identified clusters, were investigated to look for reliable evidence for certain cannabinoid profiles that have not been documented in small collections. It may be that if enough plants are tested, any combination of THC and CBD concentrations might be found. An alternative view is that such results are simply “noise” in a very large data set, one that has been generated by many different laboratories and protocols.
Searching for results that do not fit the mold, which may either be exceptions to the rule or noise, must be done while bearing in mind important caveats. One of the conclusions of the Jikomes and Zoorob study of this collection is that there was evidence of “cannabinoid inflation” by some laboratories. Their finding of “systematic differences in the cannabinoid content reported by different laboratories” has to be considered as one possible source of unexpected results. Identifying anomalies and getting a handle on the scope of unreliable results, though, should not ruin the credibility and usefulness of the data set as a whole.
The goal of the investigation of the state of Washington data collection was to get an indication whether the test results that lie between the typical clusters are:
The Washington test results were generated between June 2014 and May 2017, and the full data set totals 215,286 entries (10). Each entry, with a unique identification number, includes strain name, the name of the grower and the testing laboratory, the test date, and reported values for total THC and total CBD concentrations. Each entry is also identified by the type of product tested (flower, extract, wax, edible); thus the data set can be culled to just flower product, and the number of entries is reduced to 146,768. That quantity of entries still presents a challenge for common data analysis methods; in preparing their scatter plots of cannabinoid concentrations, Jikomes and Zoorob subsampled the data to allow visualization.
To further reduce the data set, and to focus on the strains that fall between the clusters on plots of THC and CBD, the set of all flower results was cut-off to only those entries that have a CBD value greater than 1%. With that restriction, the data set reduced to 6818 entries, and the scatter plot shown in Figure 6 can be generated with standard plotting tools. The three expected clusters are evident, as they are in scatter plots of smaller collections, but many more entries lie between the most concentrated clusters, leaving some uncertainty about which cluster they belong to. (See upper right for Figure 6, click to enlarge.)
The conspicuousness of the points between clusters can be misleading in a plot with so many points. Just as a single star is more noticeable in a dark part of a sky than if it were in the midst of a dense galaxy, so the isolation of the points between clusters can appear more prominent when there are so many data points and the clusters are especially dense. In addition, the deletion of the entries with CBD >1% has substantially thinned the THC-dominant cluster at the bottom of the chart.
To isolate those in-between data points, a new measure was calculated: the percentage of the sum of THC and CBD that is from CBD. A plot of the percent of cannabinoid from CBD shows a very interesting pattern (Figure 7). This bar chart of 6818 items is so dense that it appears to be a curve, with an interesting shape that provides a new way to identify clusters: beginning at lower left, there is a long flat area with low CBD percentage that are the THC-dominant strains; a steep upward slope, then a rising plateau are the balanced strains; and a smaller plateau with the highest CBD percentage are the CBD-dominant strains. (See upper right for Figure 7, click to enlarge.)
The in-between strains are at the slope of values between 11% and 45% for CBD as a percentage of the combined cannabinoids. This subgroup was further reduced by calculating the ratio of THC to CBD, and selecting those entries with ratios between 2 and 8; too many of the entries with a ratio of 1–2 were of very low CBD concentration and are not in the set we are seeking. A scatter plot of just the resulting 465 “slope” entries (Figure 8) show that this data reduction has captured the very points we wish to investigate: do these represent plants that are exceptions to the defined clusters, or do they simply represent testing noise? (See upper right for Figure 8, click to enlarge.)
With the data set reduced to this manageable size by selecting the “slope” entries, a closer scrutiny of each entry is feasible. For many entries, this examination revealed duplicate samples: unique identification numbers for the same strain name from the same grower tested on the same day by the same laboratory. Duplicate samples with equivalent cannabinoid profile do not add to our understanding of the validity of the slope entries, so duplicates were removed.
Removing samples with identical profiles, though, raises another topic that was a focus area of the Jikomes and Zoorob study, in addition to variation between laboratories, and that is strain consistency. Close examination of the data set reduced by removal of duplicate samples turned up a number of anomalous results. For 19 sets of entries, covering 49 entries in total, different tests of the same strain from the same grower, conducted on the same day by the same laboratory (usually recorded within an hour of each other) had such discrepancies in the reported values as to call them into question. An extreme example of this is shown in Figure 9a: five samples of the same strain tested on the same day show THC values from 3.9% to 11.7% and CBD values from 2.5% to 8.5%. There is no way now to go back and determine which of these is the most accurate number. These 19 sets of entries also raise questions about other results that are outliers, but for which there is no comparator value that would allow us to identify an anomalous result. This sort of data adds to the concern that testing noise is a contributor to the subset of results that fall between clusters. (See upper right for Figure 9, click to enlarge.)
A different sort of anomalous result in shown in Figure 9b: rather than individual results showing too much discrepancy, these results show more uniformity than should be expected. In this case, 10 different strains from one grower were tested on the same day by one laboratory, and the results are curiously consistent. While the THC values range from 5.7% to 11.8%, the CBD values are in a very tight range, from 1.1–1.2% with only one as high as 1.5%. Such a result is not impossible of course, and might be plausible if each of these strains was proprietary to the grower (leaving aside the business logic of growing 10 strains with the same profile). A check on the strains in question, though, show several strain names that are very common in this data set (Dutch Treat and Snoop’s Dream). The values in this suspect set are unlike the results of hundreds of other entries that have the same strain name, but are from other growers. This is another set of points between the clusters that we would conclude is not reliable evidence for plants with unique properties.
A different type of anomalous result, the flip side to the curious consistency in the CBD values of entries with different names, is illustrated in Figure 9c: a set of 12 tests of a single strain name, done over 3 months for one grower, with striking consistency in the THC values (10 results between 15.1 and 15.9), but with CBD values that increase from 1.08 to 4.2. The increase in the reported CBD values occurs sequentially over the 3 months of testing with only minor deviation. For one strain from one grower to shift, steadily and smoothly, from a THC–CBD ratio of 14.7 to 3.7 in 3 months, without a change in the THC value, is a remarkable result.
Putting aside those results where there is a clear question about reliability, we are left with a set of 354 “slope” entries that fall between the clusters, and now we can investigate if there are consistent results in them or whether they are better regarded as noise. The next path to look for a reliable finding was to investigate strain names that are over-represented in this set. Several of the most common strain names in the full data set are also the most common in this “slope” subset, for example Harlequin and Cannatonic. As pointed out by Jikomes and Zoorob in their study, there is considerable evidence that common strain names have been applied to plants with very dissimilar characteristics. This is evident in a scatter plot of cannabinoid concentrations for all of the 287 entries (in the set of 6818) with the strain name “Harlequin” (Figure 9d). The majority of entries are in the blended cluster at the center of the plot, but there is wide divergence, making this plot appear as a microcosm of the population as a whole (compare Figure 6).
Looking for consistent representation of certain strains or growers in the “slope” subset produces something of a mixed bag of results. Considerable dispersion exists in both strain names and growers: the final set of 354 entries contains 271 different strain names and come from 179 separate growers. Of the different strain names, only eight represent more than 1% of the total population, and several of these names are just as common in the population as a whole as in this select subset.
The 354 entries being investigated for reliable evidence of plants that fall between the clusters are just 0.24% of the total number of flower test results in the full database. Of these, only 43 are reported as having >15% THC and >5% CBD. In this most select subset, there are a few entries with repeated strain names, but each of these has its own data issues. In one case, there are four entries with the same strain name from the same grower, but these four samples, tested over 2 weeks in 2015, were entirely unlike the other 110 samples from that grower tested over the subsequent 3 years, all of which had conventional blended (1:1) THC and CBD values. It is possible that strain spuriously produced plants high in both THC and CBD in one instance, but that strain name does not provide a reliable guide to finding it again.
Similar problems are encountered when looking at results from individual growers who had more than one entry with >15% THC and >5% CBD. One strain name from a grower with two entries in this subset had a strain with a THC value of 16.8% and CBD of 5.5%, but all of the other data about this strain makes this appear to be a simple clerical error: 23 other tests of this strain name, from three growers, were reported as CBD-dominant. This is particularly made clear when we see that on the same day of the results in question, another laboratory tested the same strain from the same grower and reported reverse values: 5.5% THC and 14.4% CBD. Another grower had two entries in this set, but one strain had samples tested by different laboratories two weeks apart, one with 19.1% THC and 1.5% CBD and the other with 24.2% THC and 5.9% CBD. The second test was done by the laboratory flagged by Jikomes and Zoorob for reporting THC results “significantly higher (p < 0.001; Wald test) than all other labs,” suggesting this could be a case of a spurious laboratory result.
A search for reliable strain names or growers in the set of entries with >15% THC and >5% CBD did not prove fruitful, casting doubt on the idea that such strains can be consistently produced, if they exist at all. There are some promising hints, however, that there are reliable plant characteristics in the range of >15% THC and >2% CBD. There are a few strain names that are over-represented in this subset, though they are not exclusive to it: Lavender and Green Crack have enough entries in this sub-set, without obvious data problems, that they are promising leads to continue a search for strains that fall between the clusters.
The lack of reliable evidence for strains with cannabinoid concentrations that fall outside of the three clusters (THC-dominant, CBD-dominant, and balanced) does not necessarily mean that such strains cannot exist; after all, absence of evidence is not the same as evidence of absence. It may be the case that growers have concluded that there is not a market for strains with 15% THC and 5% CBD, and so have not attempted to initiate or propagate any plants with that combination. However, with all of the incentive for innovation in competitive markets and the interest in new experiences, it seems that someone would try to address this niche. The evidence from the very large Washington data set is that if such a strain is possible, it is a spontaneous occurrence which has not been reliably propagated.
Thomas A. Coogan, PhD, is an Academic and Research Liaison with the New Jersey Cannabis Industry Association. Direct correspondence to: firstname.lastname@example.org
T.A. Coogan, Cannabis Science and Technology3(2), 32–39 (2020).