This article provides a comprehensive guide on creating a correlation matrix for differential gene expression and analyzing the differences between normal and cancerous breast tissue samples using a dataset from NCBI GEO.
Two powerful technologies are used to analyze gene expression: DNA Microarray and RNA sequencing (RNA-seq).
DNA Microarray: Also known as DNA chips, this technology uses pre-designed probes – short sequences of DNA designed to bind to target RNA or DNA by providing the complementary strand sequence. When the sample RNA or DNA binds to these, the spots on the microarray light up under the scanner. The intensity of the fluorescent indicates the amount of target RNA or DNA present in the sample.
RNA-seq: This method involves extracting the RNA from the sample, converting it into cDNA, and breaking the cDNA into smaller pieces for sequencing. The sequencing machine reads all cDNA sequences, which are then compared to a reference genome to determine gene activity and expression levels. From these readings, a gene quantification matrix is generated, displaying the gene number and its expression level in each sample as a numeric variable.
The dataset we will work with was analyzed using the RNA-sequencing method. This technique provides a comprehensive and high-resolution snapshot of the transcriptome, allowing for detailed analysis of gene expression. By analyzing this real-world data, we can gain insights about how cancerous breast tissue may influence gene expression compared to normal breast tissue. The following sections will guide you through the steps to analyze RNA-seq data, using the Series GSE183947 dataset from NCBI GEO.
In this tutorial, we will use the “interested_genes_data.xlsx” dataset. This dataset was created from the original dataset found on NCBI GEO, but modifications were made to shorten it for tutorial purposes. For this analysis, we identified 11 genes in the original dataset that are associated with breast tumors:
1. ERBB2 (HER2): Mutations in this gene affect the PI3K/Akt signaling pathway; related to growth and survival.
2. BRCA 1 & 2: Mutations in these genes are linked with breast and ovarian cancer. They are involved in DNA repair mechanisms.
3. TP53: Often referred to as the “guardian of the genome” as it is responsible for genomic stability. Mutations often lead to loss of function and increased tumor development and progression.
4. ESR1 (Estrogen Receptor 1): This gene encodes the estrogen receptor, important in the development and progression of breast cancer. This gene is the target for hormone therapy in individuals with breast cancer.
5. PIK3CA: Mutations in this gene are common in breast cancer and affect the PI3K/Akt signaling pathway.
6. PTEN: This tumor suppressor gene becomes inactivated in many cancers, including breast cancer.
7. AKT1: Plays a role in the PI3K/Akt signaling pathway and can be involved in breast cancer when mutated.
8. MUC1: Often overexpressed in many breast cancers, it encodes a protein involved in cell signaling and protection.
9. CDH1 (E-Cadherin): Involved in cell adhesion, often lost in more invasive forms of breast cancer.
10. MKI67 (Ki-67): Associated with cell proliferation and is used as a marker to determine the growth rate of tumors.
The dataset contains 572 entries, with columns for genes, samples, FPKM (Fragments per kilobase per million mapped fragments), title, tissue and metastasis. FPKM is a normalized measure of gene expression that accounts for gene length and sequencing depth. This allows for accurate comparison of gene expression levels across samples.
This walkthrough will be done using Python, however, you can also perform the analysis in Julius using R by switching the code runtime environment toggle in the top right of the chat interface.
For the first step, we will load in the dataset from Google docs. We will then split the dataset into two subsets: one for normal breast tissue and the other for cancerous breast tissue.
Prompt: “Please load the dataset from this Google Sheets link.”
Above is a screenshot of the dataset loaded into Julius. As explained before, the dataset contains a column named ‘gene’ displaying the names of the genes, the sample number, a ‘FPKM’ (Fragments per kilobase per million mapped fragments) showing the normalized expression value, the title column with the replicate, the tissue column, and the metastatic column.
The next step is to remove the metastasis column. If you would like to include this column for further analysis after this demonstration, please feel free to do so. For this use case, we will not be using this column, so we will remove it.
Prompt: “Please remove the metastatic column.”
In the screenshot, you can see that Julius has removed the metastatic column. Our dataset now only contains ‘genes’, ‘samples’, ‘FPKM’, ‘title’, and ‘tissue’.
The next step is to split the dataset into two subsets: one with normal breast tissue and the other with the cancerous breast tissue.
Prompt: “Looking at the “tissue” column, can you split the dataset by “normal breast tissue” and “breast tumor tissue?”
Above is a screenshot of the output provided by Julius after splitting the dataset.
Let’s prompt Julius to perform descriptive statistics on this dataset.
Prompt: “Can we perform descriptive statistics on each dataset please on each gene?”
Examining the screenshots above, Julius has provided a neat output of various descriptive statistics for each gene in the normal and cancerous breast tissues. You can see the count of each gene, the average expression, the standard deviation, minimum and maximum values, and the 25%, 50% and 75% interquartile ranges for each gene. Feel free to examine these statistics and compare the normal values of each gene to the cancerous values. This should give you some insight into how the genes may be expressed differently between the two conditions.
After examining the descriptive statistics, we can look at the distribution and potential outliers in each gene. Let’s do this by creating histograms of the frequency distribution of expression rates for each gene and box plots to identify outliers.
Prompt: “Now can we get some histograms on the genes and also some box plots to see the outliers?”
The above screenshot shows the histogram of all the genes in the normal breast tissue subset, with FPKM on the x-axis and the frequency distribution on the y-axis. After visualizing each histogram, we can conclude that we do not have a normal distribution.
From the screenshot above, shows the boxplot visualization created by Julius, highlighting outliers in different genes within the normal breast tissue subset. Only BRCA2, CDH1 and ESR1 seem to contain outliers.
The above screenshot shows the histogram of all the genes in the cancerous breast tissue subset, with FPKM on the x-axis and frequency distribution on the y-axis. Like the normal breast tissue histograms, these genes also do not follow a normal distribution.
Above shows the box plot created by Julius for the cancerous breast tissue subset, examining each gene and the corresponding outliers they may have. From this visualization, we can see that most genes contain some outliers, with MUC1 having the most extreme values.
After visualizing gene expression values in a histogram and looking at the outliers present in each of the genes in both subsets, we should now statistically confirm that our dataset does not follow normal distribution. We can do this by running Shapiro-Wilk’s test of normality on each of the genes.
Prompt: “Please run a normality test on both subsets of data and on every gene to identify if they are normally distributed or not.”
From the p-values listed above in the screenshots, we can see that a majority of the genes exhibit a non-normal distribution as their p-values fall below 0.05. Therefore, when running any statistical tests, we are limited to non-parametric choices.
Our next step is to create a correlation matrix after calculating Spearman’s rank correlation. We are using Spearman’s correlation as opposed to Pearson’s correlation since we have determined our data follows non-normal distribution.
Prompt 1: “Calculate the Spearman’s rank correlation for each gene and create a correlation matrix please.”
In the above screenshot, we can see Julius has provided the correlation coefficients using Spearman's ranked correlation. We can now as Julius to visualize these correlation coefficients in a correlation matrix.
Prompt 2: “Can you visualize the correlation matrices please?”
Examining the correlation matrices shown in the above images, we observe a stark difference in the types of genes that are upregulated or downregulated when compared to one another. In the normal correlation matrix, many genes are positively correlated with one another. For example, in normal breast tissue, we can see BRCA2 and BRCA1 are moderately correlated with a coefficient of 0.65. However, in the breast tumor correlation matrix, these same genes show a slight negative correlation of -0.01.
The side-by-side comparison of these two heatmaps highlights the disruption in gene expression in cancerous cells versus healthy ones. This disruption could be due to multiple factors, such as the molecular changes that occur in cancerous cells, which in turn can affect how the genes interact and function with one another.
Our next step is to determine if the changes in gene expression between normal and cancerous cells are statistically significant. We can use the Mann-Whitney U test for this, which compares differences between two independent groups using the median value. This test is appropriate for datasets that follow non-normal distribution.
Prompt: “Please perform the Mann-Whitney U test to compare gene expression levels between normal and cancerous breast tissue in each gene.”
In the screenshot above, Julius lists each gene with its corresponding statistic and p-value. From the analysis, we see that there is a statistically significant difference in expression levels for the ESR1, AKT1, MKI67, and PTEN genes.
After determining statistical significance in gene expression between cancerous and normal breast tissues, we can visualize the differences using a box plot to understand how these differences appear.
Prompt: “Please create a box plot with the four genes that are statistically significant from one another and add in an asterisk with brackets highlighting the two tissue types that are statistically significant from one another. Please add * for p<0.05 and ** for p<0.001.”
The screenshot above shows the box plot that was generated by Julius, highlighting significant differences in gene expression between normal and cancerous breast cancer cells. Upon examination of this figure, we can see which genes are downregulated or upregulated in cancerous cells compared to normal cells. We could also perform a fold change analysis to determine the direction of change, but this visualization was sufficient for this analysis.
In this tutorial, we explored how to effectively analyze RNA-seq data to investigate gene expression differences between cancerous and normal breast tissues. We began with descriptive statistics and visualizations to understand data distributions and identify outliers. The Shapiro-Wilk test confirmed that our dataset did not follow a normal distribution.
Next, we analyzed correlations between genes using Spearman’s rank correlation. This method revealed how genes correlated with each other and highlighted changes in these correlations within cancerous breast tissue and normal tissue. We then assessed the statistical significance of gene expression changes between cancerous and normal breast tissue using the Mann-Whitney U test and visualized the statistically significant differences via a box plot.
This tutorial demonstrates how Julius can help you successfully analyze RNA-seq data and create meaningful visualizations using real-world data.