The concepts of orthology and paralogy, accompanied by precise definitions (see below), were introduced by Walter Fich in a seminal paper published as early as 1970 [1]. Over the next three decades or so, the use and abuse (not infrequently misuse) of these terms gave rise to interesting discussions about the fruitfulness of such concepts [2-4], a debate in which Fitch himself took part [5].
Currently, in the post-genome era, with hundreds of genomes sequenced, there is little doubt of the usefulness, I would even say necessity, of these concepts. Thus, as stated by Eugene V Koonin, a researcher at the Natioonal Center for Biotecnhology Information (NCBI), “a clear distinction between orthologs and paralogs is critical for the construction of a robust evolutionary classification of genes and reliable functional annotation of newly sequenced genomes” [6].
Before adding another word, let’s remember the original definitions of homology, orthology and parlogy in the context of gene/protein evolution.
Much of the confusion arises from the misconception that two paralogous genes must be present in the same genome (the same species), when nothing in the definition imposes such a restriction. Such confusion may have been favored by the fact that at the time during evolution when paralogs originate by duplication, both copies will be present, of course, in the same genome (Figure 1), but it does not imply that these paralogs will always be restricted to the same genome (organism) as evolution progresses.
Along time, gene families expand (gene duplication) and contract (gene loss), so we can end up finding paralogs in different species. Sometimes these paralogs are easy to recognize, for instance, multiple homologs in the same genome will always be paralogs. However, many other times paralogs are easy to confuse with orthologs (Figure 2).
Before we move forward and jump into the specifics of the GS evolutionary history in gymnosperms, one last general cautionary note. Another misconception that we often found in the literature, is related to the use of the term ‘ortholog’ for proteins present in different species but fulfiling the same function, while in these msleading context, the term ‘paralogs’ is reserved for proteins found in the same species but performing different functions. However, as pointed by Jensen [4]: “Although plenty of examples exist for which this evolutionary scenario has indeed played out, it is quite possible for orthologs to acquire different catalytic (or regulatory) properties and for paralogs to retain the same function”. In fact, recent data suggests that this last possibility occurs very frequently [7].
The enzyme glutamine synthetase (GS, EC 6.3.1.2) catalyzes the incorporation of ammonium to glutamate to form glutamine in an ATP-dependent fashion. This enzyme plays a crucial role in plant nitrogen assimilation, being responsible for up to 95 % of the ammonium assimilated in plants.
Conifers present two well differentiated families of cytosolic GS isoforms: GS1a, mainly expressed in photosynthetic tissues, and GS1b being relevant in nonphotosynthetic tissues. These GS1 lineages have been detected in all gymnosperms as well as in basal groups of angiosperms [8]. In general, GS1a is encoded by a single gene (locus). In contrast, it is not uncommon to find different loci contributing to the variability of GS1b, both in gymnosperms and angiosperms, although in a much more pronounced way in the latter.
To illustrate the points we aim to develop and exposed in the current vignette, we have selected 12 conifer species whose phylogenetic relationships are well established (Figure 3).
Together, these species contribute 29 GS isoforms: 12 GS1a proteins (one per species) and 17 GS1b proteins (Figure 4).
Next, we point a few instructions to plot both the species and protein family trees using the data included into the orthG package.
# Plot species tree:
str <- ape::read.tree(text = "((((Pa,Psm),(Pp,Pin)),(Abi,Ap)),((Ara,(Pod,Nag)),(Sci,(Tba,Tax))));")
plot(str)
# Load GS sequence data:
agf <- agf
# Aligning sequences and building an unrooted tree
# (remember you need the MUSCLE software in your path):
conif <- agf[which(agf$short %in% str$tip.label), ]
ptr <- mltree(msa(sequences = conif$prot,
ids = conif$phylo_id,
inhouse = TRUE)$ali)$tree
# Rooting and plotting an ultrametric tree :
ptr <- madRoot(ptr)
plot(ptr, use.edge.length = FALSE, cex = 0.6)
As it is evident from previous figures, there are some conflicts between the well established species tree (Figure 3) and the GS protein tree (Figure 4) we have build up. For instance, the genus Picea is sister to the genus Pinus, however GS1a from Picea species are more closely related to the GS1a from Abies than Pinus, which obviously demand reconciliation!
When we look at the genomes of these 12 conifer species, we find that GS1a is a single-copy gene. That is, that each species has one and only one GS1a gene. We could naively thing that GS1a is a good gene to infer the phylogenetic relationships of these species. However, we have seen that this is not the case. The reason is that we have assumed that the 12 GS1a proteins we have been analyzing are orthologs, when in reality, as we shall see below, we have a mixture of orthologs and paralogs.
In phylogenetics, reconciliation is the term used to encompass a wide and varied range of inference techniques that aim to find the historical events (gene duplication, transfer, loss, etc) that best explain the inconsistencies between gene tree and species tree. Although many approaches have been developed in the last few years (see [9] for a review), the one implemented in the orthGS package is the method described in [10], which is based on the parsimony principle: the best gene/protein family history is the history with the fewest events.
We will focus in the six species belonging to the order Pinales (green rectangle from Figure 3), which phylogenetic relationships seem to be in conflict with that suggested in the protein tree (Figure 4, top pink rectangle). So, let’s start by collecting the relevant data regarding the GS isoforms present in this set of species:
data <- subsetGS(sp = c("Ap", "Abi", "Pin", "Pp", "Psm", "Pa"))[, 2:9]
data$phylo_id
#> [1] "Pp_GS1a" "Pa_GS1a" "Ap_GS1a" "Abi_GS1a" "Psm_GS1a"
#> [6] "Pin_GS1a" "Pp_GS1b_1" "Pp_GS1b_2" "Pa_GS1b_1" "Pa_GS1b_2"
#> [11] "Ap_GS1b" "Abi_GS1b_1" "Psm_GS1b_1" "Psm_GS1b_2" "Pin_GS1b_1"
#> [16] "Pin_GS1b_2"
We can check that there are 16 forms of GS: 6 GS1a proteins (one per
species) and 10 GS1b isoforms. Next, we are going to plot the orthology
network for this set of proteins. For this purpose we will use the
function orthG
that will take as argument the set of
species that interest us:
o <- orthG(set = c("Ap", "Abi", "Pin", "Pp", "Psm", "Pa"))
A <- o[[1]] # Adjacency matrix
g <- o[[2]] # igraph object (orthology network)
It should be noted that the function orthG
return a list
with two objects. The first one is an adjacency matrix. In our case a
square matrix of order 16 whose entries are 1 for orthology or 0 for
paralogy. The second element of the returned list is an igraph
network.
If desired, we can plot a subnetwork indicating the nodes of interest. For instance, let’s focus our interest on GS1a:
gs1a <- data$phylo[grepl("GS1a", data$phylo_id)] # selected nodes
plot(igraph::subgraph(g, vids = gs1a))
We observe that GS1a isoforms from the genus Pinus are paralogs of the GS1 from the genus Picea and Abies. On the other hand, all the GS1a from Picea and Abies are orthologs among them. In this way, the phylogenetic relationships inferred for the GS1a proteins from Pinales (Figure 4), are be reconciled with the phylogenetic tree of the species (Figure 3). Furthermore, the sequence of evolutionary events leading to such a reconcilation can be inferred and plotted (Figure 5)
As already mentioned, GS1a is an isoform described to be present not only in all gymnosperms, including Aracaurales, Cupressales, Pinales, Gnetales, Ginkgoales and Cycadales, but also in basal angiosperms such as species from the orders Amborellales and Magnoliales.
The data included in the orthGS package cover 34 GS1a proteins from as many species.
This parity between the number of proteins and the number of species is rather misleading, since it might induce us to think that all these proteins are orthologs and that they have diverged exclusively due to speciation processes. However, if we plot the orthology network inferred by tree reconciliation, the picture is quite different: gene duplication within the GS1a lineage is frequent, but unlike what happens with the GS1b lineage, in the case of GS1a duplications seem to be balanced by losses.
It would be interesting to investigate in the future whether this balance between duplications and losses responds to selective forces or, on the contrary, can be explained based on the stochastic nature of these processes.