How to Create a Correlation Matrix & Analyze Gene Expression

This article provides a comprehensive guide on creating a correlation matrix for differential gene expression and analyzing the differences between normal and cancerous breast tissue samples using a dataset from NCBI GEO.

Introduction

Two powerful technologies are used to analyze gene expression: DNA Microarray and RNA sequencing (RNA-seq).


DNA Microarray: Also known as DNA chips, this technology uses pre-designed probes – short sequences of DNA designed to bind to target RNA or DNA by providing the complementary strand sequence. When the sample RNA or DNA binds to these, the spots on the microarray light up under the scanner. The intensity of the fluorescent indicates the amount of target RNA or DNA present in the sample.


RNA-seq: This method involves extracting the RNA from the sample, converting it into cDNA, and breaking the cDNA into smaller pieces for sequencing. The sequencing machine reads all cDNA sequences, which are then compared to a reference genome to determine gene activity and expression levels. From these readings, a gene quantification matrix is generated, displaying the gene number and its expression level in each sample as a numeric variable.


The dataset we will work with was analyzed using the RNA-sequencing method. This technique provides a comprehensive and high-resolution snapshot of the transcriptome, allowing for detailed analysis of gene expression. By analyzing this real-world data, we can gain insights about how cancerous breast tissue may influence gene expression compared to normal breast tissue. The following sections will guide you through the steps to analyze RNA-seq data, using the Series GSE183947 dataset from NCBI GEO.

Dataset Overview

In this tutorial, we will use the “interested_genes_data.xlsx” dataset. This dataset was created from the original dataset found on NCBI GEO, but modifications were made to shorten it for tutorial purposes. For this analysis, we identified 11 genes in the original dataset that are associated with breast tumors:


1. ERBB2 (HER2): Mutations in this gene affect the PI3K/Akt signaling pathway; related to growth and survival.


2. BRCA 1 & 2: Mutations in these genes are linked with breast and ovarian cancer. They are involved in DNA repair mechanisms.


3. TP53: Often referred to as the “guardian of the genome” as it is responsible for genomic stability. Mutations often lead to loss of function and increased tumor development and progression.


4. ESR1 (Estrogen Receptor 1): This gene encodes the estrogen receptor, important in the development and progression of breast cancer. This gene is the target for hormone therapy in individuals with breast cancer.


5. PIK3CA: Mutations in this gene are common in breast cancer and affect the PI3K/Akt signaling pathway.


6. PTEN: This tumor suppressor gene becomes inactivated in many cancers, including breast cancer.


7. AKT1: Plays a role in the PI3K/Akt signaling pathway and can be involved in breast cancer when mutated.


8. MUC1: Often overexpressed in many breast cancers, it encodes a protein involved in cell signaling and protection.


9. CDH1 (E-Cadherin): Involved in cell adhesion, often lost in more invasive forms of breast cancer.


10. MKI67 (Ki-67): Associated with cell proliferation and is used as a marker to determine the growth rate of tumors.


The dataset contains 572 entries, with columns for genes, samples, FPKM (Fragments per kilobase per million mapped fragments), title, tissue and metastasis. FPKM is a normalized measure of gene expression that accounts for gene length and sequencing depth. This allows for accurate comparison of gene expression levels across samples.

Step-by-Step Walkthrough

Step 1: Import your dataset into Julius

This walkthrough will be done using Python, however, you can also perform the analysis in Julius using R by switching the code runtime environment toggle in the top right of the chat interface.


For the first step, we will load in the dataset from Google docs. We will then split the dataset into two subsets: one for normal breast tissue and the other for cancerous breast tissue.


Prompt: “Please load the dataset from this Google Sheets link.

Uploaded preview for genes dataset

Above is a screenshot of the dataset loaded into Julius. As explained before, the dataset contains a column named ‘gene’ displaying the names of the genes, the sample number, a ‘FPKM’ (Fragments per kilobase per million mapped fragments) showing the normalized expression value, the title column with the replicate, the tissue column, and the metastatic column.

Step 2: Removing the Metastasis Column

The next step is to remove the metastasis column. If you would like to include this column for further analysis after this demonstration, please feel free to do so. For this use case, we will not be using this column, so we will remove it.


Prompt: “Please remove the metastatic column.”

Removing metastasis column using Julius

In the screenshot, you can see that Julius has removed the metastatic column. Our dataset now only contains ‘genes’, ‘samples’, ‘FPKM’, ‘title’, and ‘tissue’.

Step 3: Split the dataset

The next step is to split the dataset into two subsets: one with normal breast tissue and the other with the cancerous breast tissue.


Prompt: “Looking at the “tissue” column, can you split the dataset by “normal breast tissue” and “breast tumor tissue?”

Prompting Julius to split data set by "normal breast tissue" and "breast tumor"
Breast tumor separated dataset

Above is a screenshot of the output provided by Julius after splitting the dataset.

Step 4: Run Descriptive Statistics

Let’s prompt Julius to perform descriptive statistics on this dataset.


Prompt: “Can we perform descriptive statistics on each dataset please on each gene?”

Filtered normal breast tissue dataset
Filtered breast tumor dataset

Examining the screenshots above, Julius has provided a neat output of various descriptive statistics for each gene in the normal and cancerous breast tissues. You can see the count of each gene, the average expression, the standard deviation, minimum and maximum values, and the 25%, 50% and 75% interquartile ranges for each gene. Feel free to examine these statistics and compare the normal values of each gene to the cancerous values. This should give you some insight into how the genes may be expressed differently between the two conditions.

Step 5: Examining Distribution and Outliers

After examining the descriptive statistics, we can look at the distribution and potential outliers in each gene. Let’s do this by creating histograms of the frequency distribution of expression rates for each gene and box plots to identify outliers.


Prompt: “Now can we get some histograms on the genes and also some box plots to see the outliers?”

prompt histograms and box plots for outliers
Julius generation for histograms

The above screenshot shows the histogram of all the genes in the normal breast tissue subset, with FPKM on the x-axis and the frequency distribution on the y-axis. After visualizing each histogram, we can conclude that we do not have a normal distribution.

Julius generation for box plots

From the screenshot above, shows the boxplot visualization created by Julius, highlighting outliers in different genes within the normal breast tissue subset. Only BRCA2, CDH1 and ESR1 seem to contain outliers.

Histograms of gene expression levels in breast tumor generated by Julius

The above screenshot shows the histogram of all the genes in the cancerous breast tissue subset, with FPKM on the x-axis and frequency distribution on the y-axis. Like the normal breast tissue histograms, these genes also do not follow a normal distribution.

Box Plots of Gene Expression Level in Breast Tumor

Above shows the box plot created by Julius for the cancerous breast tissue subset, examining each gene and the corresponding outliers they may have. From this visualization, we can see that most genes contain some outliers, with MUC1 having the most extreme values.

Step 6: Performing Shapiro-Wilk’s Test of Normality

After visualizing gene expression values in a histogram and looking at the outliers present in each of the genes in both subsets, we should now statistically confirm that our dataset does not follow normal distribution. We can do this by running Shapiro-Wilk’s test of normality on each of the genes.


Prompt: “Please run a normality test on both subsets of data and on every gene to identify if they are normally distributed or not.”

Normal breast tissue data normality results
Breast Tumor data normality results


From the p-values listed above in the screenshots, we can see that a majority of the genes exhibit a non-normal distribution as their p-values fall below 0.05. Therefore, when running any statistical tests, we are limited to non-parametric choices.

Step 7: Creating a Correlation Matrix

Our next step is to create a correlation matrix after calculating Spearman’s rank correlation. We are using Spearman’s correlation as opposed to Pearson’s correlation since we have determined our data follows non-normal distribution.


Prompt 1: “Calculate the Spearman’s rank correlation for each gene and create a correlation matrix please.”

Filtered normal breast tissue dataset
Filtered breast tumor dataset generated

In the above screenshot, we can see Julius has provided the correlation coefficients using Spearman's ranked correlation. We can now as Julius to visualize these correlation coefficients in a correlation matrix.


Prompt 2: “Can you visualize the correlation matrices please?”

Correlation matrixes prompt for Julius
Heatmaps visualizing correlation matrixes for filtered datasets

Examining the correlation matrices shown in the above images, we observe a stark difference in the types of genes that are upregulated or downregulated when compared to one another. In the normal correlation matrix, many genes are positively correlated with one another. For example, in normal breast tissue, we can see BRCA2 and BRCA1 are moderately correlated with a coefficient of 0.65. However, in the breast tumor correlation matrix, these same genes show a slight negative correlation of -0.01.


The side-by-side comparison of these two heatmaps highlights the disruption in gene expression in cancerous cells versus healthy ones. This disruption could be due to multiple factors, such as the molecular changes that occur in cancerous cells, which in turn can affect how the genes interact and function with one another.


Step 8: Examining the changes in gene expression

Our next step is to determine if the changes in gene expression between normal and cancerous cells are statistically significant. We can use the Mann-Whitney U test for this, which compares differences between two independent groups using the median value. This test is appropriate for datasets that follow non-normal distribution.


Prompt: “Please perform the Mann-Whitney U test to compare gene expression levels between normal and cancerous breast tissue in each gene.”

Mann-Whitney U test to compare gene expression levels between normal and cancerous breast tissues

In the screenshot above, Julius lists each gene with its corresponding statistic and p-value. From the analysis, we see that there is a statistically significant difference in expression levels for the ESR1, AKT1, MKI67, and PTEN genes.

Step 9: Visualizing Statistical Significance

After determining statistical significance in gene expression between cancerous and normal breast tissues, we can visualize the differences using a box plot to understand how these differences appear.



Prompt: “Please create a box plot with the four genes that are statistically significant from one another and add in an asterisk with brackets highlighting the two tissue types that are statistically significant from one another. Please add * for p<0.05 and ** for p<0.001.”

Julius prompt for revised visualization for box plot
Box Plot with brackets and asterisks showing significant gene expression differences in normal vs cancerous breast tissues

The screenshot above shows the box plot that was generated by Julius, highlighting significant differences in gene expression between normal and cancerous breast cancer cells. Upon examination of this figure, we can see which genes are downregulated or upregulated in cancerous cells compared to normal cells. We could also perform a fold change analysis to determine the direction of change, but this visualization was sufficient for this analysis.

Conclusion

In this tutorial, we explored how to effectively analyze RNA-seq data to investigate gene expression differences between cancerous and normal breast tissues. We began with descriptive statistics and visualizations to understand data distributions and identify outliers. The Shapiro-Wilk test confirmed that our dataset did not follow a normal distribution.


Next, we analyzed correlations between genes using Spearman’s rank correlation. This method revealed how genes correlated with each other and highlighted changes in these correlations within cancerous breast tissue and normal tissue. We then assessed the statistical significance of gene expression changes between cancerous and normal breast tissue using the Mann-Whitney U test and visualized the statistically significant differences via a box plot.


This tutorial demonstrates how Julius can help you successfully analyze RNA-seq data and create meaningful visualizations using real-world data.


— Your AI for Analyzing Data & Files

Turn hours of wrestling with data into minutes on Julius.