How to Perform Exploratory Data Analysis (EDA) with Julius

How to Append and Merge Datasets Using Julius AI

In data analysis, datasets are crucial for extracting insights and making informed decisions. However, data often arrives in fragmented form, which requires consolidation through appending and merging techniques to address this challenge.

Appending involves adding rows from one dataset to another, expanding its vertical scope. On the other hand, merging involves combining two datasets using shared keys or identifiers, enhancing the depth of data analysis.

Both appending and merging are essential for creating comprehensive datasets that provide a holistic view of the available information.

Julius AI provides a wide range of capabilities, including seamless integration of appending and merging functionalities for datasets. This capability greatly simplifies the preparatory stages of data analysis or machine learning projects. Appending incorporates rows from one dataset into another, increasing its size and breadth, while merging datasets involves consolidating two or more datasets based on common columns or indices.

Example Datasets for this Tutorial

In this tutorial, we will be working with three example datasets: Example_Dataset_A, Example_Dataset_B, and Example_Dataset_C. These datasets were created using the data from tutorial 1 to simulate real-world scenarios where datasets may need to be appended or merged.

The example datasets contain a subset of data collected for a study on the association between post-partum mortality (death within the first 6 months) of women with peripartum cardiomyopathy (PPCM) and various factors. These factors include maternal age, maternal haemoglobin levels, weight, heart rate, systolic blood pressure (SBP), and C-reactive protein (CRP) levels. To access the datasets and follow along with this tutorial, please click on the provided links.

You can download the datasets to follow along with this tutorial:

Example_Dataset_A

Example_Dataset_B

Example_Dataset_C

Step-by-Step Walkthrough

Below, you will find a detailed, step-by-step walkthrough and explanation of how to append and merge datasets in Julius AI. Screenshots have been included for each step to assist you in navigating through the process smoothly.

Appending Datasets

Load and preview the Datasets: Start by loading both datasets into Julius AI. You can do this by uploading the files through the chat interface.

We began by uploading Example_Dataset_A and Example_Dataset_B to demonstrate the process of appending them using Julius AI. Once the datasets were uploaded, Julius AI provided a preview and explained their contents. Julius AI observed that both datasets had similar columns, with the following variables: patient number (NoPat), mortality status (mortality1), age, age category (Age_cat), hypertension status, height, weight in kg, systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), cholesterol level (Cholest), glucose level (Gluc), C-reactive protein level (CRP), sodium level (Na), potassium level (K), and haemoglobin level (HB). Julius also mentioned that Dataset A begins with patient number 1, whereas Dataset B starts from patient number 51, indicating that the datasets might be connected as part of a larger continuous dataset. This is the primary distinction between the datasets.

Describe the datasets: In addition to the brief description provided by Julius AI when uploading the datasets, you may want to ask him for further clarification regarding the structure and content of the datasets. This will help you identify any differences and ensure that the datasets have common columns before proceeding with the appending process.

Prompt: "Can you describe each dataset i.e. the number of participants, the size of the dataset, the number of columns? Are there any similarities or differences?"

Julius AI explained that both Dataset A and Dataset B each consist of 50 entries and have 16 columns. He also mentioned that this means each dataset represents data from 50 participants, making a total of 100 participants when considering both datasets together. The columns in both datasets are identical and cover various health metrics, including mortality status, age, hypertension status, height, weight, blood pressure (both systolic and diastolic), heart rate, cholesterol level, glucose level, C-reactive protein level, sodium level, potassium level, and haemoglobin level.

He highlighted that the mean values for most metrics are quite similar between the two datasets, suggesting that they may either be part of a larger dataset split into two parts or represent similar populations. The primary difference lies in the patient numbers (NoPat), which start from 1 in Dataset A and continue from 51 in Dataset B. This confirms that these datasets are sequential parts of a larger dataset.

Append Command: Now we can go ahead and ask Julius AI to append the datasets.

Prompt: "Append Dataset A and Dataset B to create a combined dataset for further analysis"

Julius AI successfully merged the two datasets, resulting in a combined dataset that now contains data from both Dataset A and Dataset B. The combined dataset consists of a total of 100 entries and retains the same 16 columns as the individual datasets. Julius AI stated that the dataset is now prepared for further analysis.

Review the Merged Dataset: Once you have appended the datasets, it is important to preview the combined dataset to ensure that the appending was successful. You can ask Julius AI to help you with this task by previewing the dataset and highlighting where the appending took place. Take the time to review it carefully and make sure it meets your expectations.

Prompt: "Can you preview the appended dataset to show where the dataset was joined and how it was joined?"

Julius AI fulfilled our request by reviewing the dataset and presenting the combined dataset, focusing on the point where Dataset A and Dataset B were combined. He clarified that Rows 45 to 50 contained the final few entries from Dataset A, while rows 51 to 55 represented the initial entries from Dataset B. The transition between the two datasets was seamless, as the patient numbers (NoPat) continued sequentially from 50 in Dataset A to 51 in Dataset B, indicating a successful append operation.

Prompt: "Can you export the appended dataset and name it \"appended_dataset"

Julius AI has successfully renamed the dataset.csv file and included a link for exporting it. You can download the file using the provided link. Remember to also specify the preferred format for downloading the files, such as Excel or CSV.sv formats. Additionally, be aware that we will be using the appended_dataset as a base/primary dataset for our upcoming task of merging datasets.

Merging Datasets

Load the Datasets: Similar to appending, start by loading the datasets you wish to merge. When merging datasets, the dataset that you begin with, or the one that acts as the main reference point for merging additional data, is commonly known as the "base dataset" or the "primary dataset." The dataset that you are adding or merging with the base dataset is often referred to as the "secondary dataset," the "additional dataset," or simply the "dataset to be added." In our example, we will merge the dataset Example_dataset C (additional dataset) to the appended_dataset (base dataset) . Let's start by previewing Example_dataset C.

Prompt: "Preview and describe the attached dataset \"Example_dataset C\""

Julius AI summarized "Example dataset C" which contains 100 entries and has three columns: NoPat (patient number), Gender, and Province. The Gender column has two distinct values, representing males and females, with females making up the majority (56 out of 100). The dataset includes data from 9 different provinces, with the Northern Cape being the province most frequently recorded. The NoPat column ranges from 1 to 100, indicating that the dataset likely represents a sequential list of patients.

Identify the Common Key: Now that we have seen the appearance of both the Appended_Dataset and Example_Dataset_C, our next step is to identify the column(s) that these datasets have in common. This will serve as the foundation for appending them. From the summary provided above, it is evident that both datasets share the variable "NoPat" for participant number. Therefore, this will be the key for merging the datasets. Otherwise, you could also ask Julius AI, to confirm the variables that are shared between the two datasets (See response below).

Prompt: "Which variables are shared between the Appended_Dataset and Example_Dataset_C?"

Julius AI confirmed that the shared variable between the Appended_Dataset and Example_Dataset_C is called NoPat. This variable indicates that both datasets have a common column, which can be used to merge or compare the datasets using patient identifiers.

Merge Command: Now, we can ask Julius AI to merge the two datasets. You can specify the common key variable and how you want to merge (e.g., inner, outer, left, right). However, Julius AI already knows the common column, as indicated above. You may also indicate how you want the new variables to be ordered.

Prompt: "Can you merge this Dataset \"Example_Dataset_C\" with \"appended_dataset\". Please order the variables \"Gender\" and \"Province\" after \"NoPat\""

Julius AI has provided an update stating that the merging of the datasets "Example_Dataset_C" and "appended_dataset" was successful. The merge was based on the NoPat column, and as requested, the variables Gender and Province have been rearranged to appear immediately after the NoPat column. The merged dataset now includes all the original columns from both datasets, allowing for a comprehensive view of each patient's information, including their gender, province, and various health metrics.

Review the Appended Dataset: It is important that before you export the dataset, you preview it to ensure that everything is as you expected. Julius AI will display the new dataset. Check to ensure all columns have been added correctly.

Julius AI has saved the merged dataset as "merged_dataset.csv" and shared a download link for accessing it. You have the option to choose the format in which you would like to export the dataset and the name you want to give it. In our example, we have decided to rename the dataset as "Example dataset_merged" (shown below).

Prompt: "Please name this dataset \"Example dataset_merged\" and save it as an excel file"

Note: If you need to append or merge multiple datasets, you can follow the same steps. With Julius AI, you can effortlessly merge or append multiple files simultaneously using just one prompt. However, it is recommended that you review your dataset after each step to ensure you are obtaining the desired results.

Tips for Effective merging or appending of datasets.

- Data Integrity: Before merging or appending datasets, make sure they are clean and pre-processed to prevent errors from being introduced (see Data Cleaning and Preparation Tutorial).

- Check key columns: When merging, confirm that the key columns have the same data type and contain matching values as expected.

- Create backups: Before performing any operations, consider creating backups of your original datasets.

- Verify column names: When appending, ensure that the datasets have matching column names and data types to avoid any issues.

- Remember to export your datasets after making any changes. Julius AI can only store your datasets for an hour before deleting them from the server.

Conclusion

Appending and merging datasets are vital tasks in data analysis. These tasks can be easily completed using Julius AI. By following the steps outlined in this tutorial and utilising the example datasets provided, users can efficiently prepare their data for analysis or machine learning tasks. Remember to use the tips provided to ensure a smooth data manipulation experience. This tutorial aims to empower users to handle their datasets confidently and efficiently, maximising the potential of Julius AI's capabilities.