In data analysis, datasets are crucial for extracting insights and making informed decisions. However, data often arrives in fragmented form, which requires consolidation through appending and merging techniques to address this challenge.
Appending involves adding rows from one dataset to another, expanding its vertical scope. On the other hand, merging involves combining two datasets using shared keys or identifiers, enhancing the depth of data analysis.
Both appending and merging are essential for creating comprehensive datasets that provide a holistic view of the available information.
Julius AI provides a wide range of capabilities, including seamless integration of appending and merging functionalities for datasets. This capability greatly simplifies the preparatory stages of data analysis or machine learning projects. Appending incorporates rows from one dataset into another, increasing its size and breadth, while merging datasets involves consolidating two or more datasets based on common columns or indices.
In this tutorial, we will be working with three example datasets: Example_Dataset_A, Example_Dataset_B, and Example_Dataset_C. These datasets were created using the data from tutorial 1 to simulate real-world scenarios where datasets may need to be appended or merged.
The example datasets contain a subset of data collected for a study on the association between post-partum mortality (death within the first 6 months) of women with peripartum cardiomyopathy (PPCM) and various factors. These factors include maternal age, maternal haemoglobin levels, weight, heart rate, systolic blood pressure (SBP), and C-reactive protein (CRP) levels. To access the datasets and follow along with this tutorial, please click on the provided links.
You can download the datasets to follow along with this tutorial:
Example_Dataset_A
Example_Dataset_B
Example_Dataset_C
Below, you will find a detailed, step-by-step walkthrough and explanation of how to append and merge datasets in Julius AI. Screenshots have been included for each step to assist you in navigating through the process smoothly.
We began by uploading Example_Dataset_A and Example_Dataset_B to demonstrate the process of appending them using Julius AI. Once the datasets were uploaded, Julius AI provided a preview and explained their contents. Julius AI observed that both datasets had similar columns, with the following variables: patient number (NoPat), mortality status (mortality1), age, age category (Age_cat), hypertension status, height, weight in kg, systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), cholesterol level (Cholest), glucose level (Gluc), C-reactive protein level (CRP), sodium level (Na), potassium level (K), and haemoglobin level (HB). Julius also mentioned that Dataset A begins with patient number 1, whereas Dataset B starts from patient number 51, indicating that the datasets might be connected as part of a larger continuous dataset. This is the primary distinction between the datasets.
Prompt: "Can you describe each dataset i.e. the number of participants, the size of the dataset, the number of columns? Are there any similarities or differences?"
Julius AI explained that both Dataset A and Dataset B each consist of 50 entries and have 16 columns. He also mentioned that this means each dataset represents data from 50 participants, making a total of 100 participants when considering both datasets together. The columns in both datasets are identical and cover various health metrics, including mortality status, age, hypertension status, height, weight, blood pressure (both systolic and diastolic), heart rate, cholesterol level, glucose level, C-reactive protein level, sodium level, potassium level, and haemoglobin level.
He highlighted that the mean values for most metrics are quite similar between the two datasets, suggesting that they may either be part of a larger dataset split into two parts or represent similar populations. The primary difference lies in the patient numbers (NoPat), which start from 1 in Dataset A and continue from 51 in Dataset B. This confirms that these datasets are sequential parts of a larger dataset.
Prompt: "Append Dataset A and Dataset B to create a combined dataset for further analysis"
Julius AI successfully merged the two datasets, resulting in a combined dataset that now contains data from both Dataset A and Dataset B. The combined dataset consists of a total of 100 entries and retains the same 16 columns as the individual datasets. Julius AI stated that the dataset is now prepared for further analysis.
Prompt: "Can you preview the appended dataset to show where the dataset was joined and how it was joined?"
Julius AI fulfilled our request by reviewing the dataset and presenting the combined dataset, focusing on the point where Dataset A and Dataset B were combined. He clarified that Rows 45 to 50 contained the final few entries from Dataset A, while rows 51 to 55 represented the initial entries from Dataset B. The transition between the two datasets was seamless, as the patient numbers (NoPat) continued sequentially from 50 in Dataset A to 51 in Dataset B, indicating a successful append operation.
Prompt: "Can you export the appended dataset and name it \"appended_dataset"
Julius AI has successfully renamed the dataset.csv file and included a link for exporting it. You can download the file using the provided link. Remember to also specify the preferred format for downloading the files, such as Excel or CSV.sv formats. Additionally, be aware that we will be using the appended_dataset as a base/primary dataset for our upcoming task of merging datasets.
Prompt: "Preview and describe the attached dataset \"Example_dataset C\""
Julius AI summarized "Example dataset C" which contains 100 entries and has three columns: NoPat (patient number), Gender, and Province. The Gender column has two distinct values, representing males and females, with females making up the majority (56 out of 100). The dataset includes data from 9 different provinces, with the Northern Cape being the province most frequently recorded. The NoPat column ranges from 1 to 100, indicating that the dataset likely represents a sequential list of patients.
Prompt: "Which variables are shared between the Appended_Dataset and Example_Dataset_C?"
Julius AI confirmed that the shared variable between the Appended_Dataset and Example_Dataset_C is called NoPat. This variable indicates that both datasets have a common column, which can be used to merge or compare the datasets using patient identifiers.
Prompt: "Can you merge this Dataset \"Example_Dataset_C\" with \"appended_dataset\". Please order the variables \"Gender\" and \"Province\" after \"NoPat\""
Julius AI has provided an update stating that the merging of the datasets "Example_Dataset_C" and "appended_dataset" was successful. The merge was based on the NoPat column, and as requested, the variables Gender and Province have been rearranged to appear immediately after the NoPat column. The merged dataset now includes all the original columns from both datasets, allowing for a comprehensive view of each patient's information, including their gender, province, and various health metrics.
Julius AI has saved the merged dataset as "merged_dataset.csv" and shared a download link for accessing it. You have the option to choose the format in which you would like to export the dataset and the name you want to give it. In our example, we have decided to rename the dataset as "Example dataset_merged" (shown below).
Prompt: "Please name this dataset \"Example dataset_merged\" and save it as an excel file"
Note: If you need to append or merge multiple datasets, you can follow the same steps. With Julius AI, you can effortlessly merge or append multiple files simultaneously using just one prompt. However, it is recommended that you review your dataset after each step to ensure you are obtaining the desired results.
- Data Integrity: Before merging or appending datasets, make sure they are clean and pre-processed to prevent errors from being introduced (see Data Cleaning and Preparation Tutorial).
- Check key columns: When merging, confirm that the key columns have the same data type and contain matching values as expected.
- Create backups: Before performing any operations, consider creating backups of your original datasets.
- Verify column names: When appending, ensure that the datasets have matching column names and data types to avoid any issues.
- Remember to export your datasets after making any changes. Julius AI can only store your datasets for an hour before deleting them from the server.
Appending and merging datasets are vital tasks in data analysis. These tasks can be easily completed using Julius AI. By following the steps outlined in this tutorial and utilising the example datasets provided, users can efficiently prepare their data for analysis or machine learning tasks. Remember to use the tips provided to ensure a smooth data manipulation experience. This tutorial aims to empower users to handle their datasets confidently and efficiently, maximising the potential of Julius AI's capabilities.