Jmis 26 4 167
Jmis 26 4 167
Jmis 26 4 167
Big data have revolutionized the way data are processed and used across all fields. In the Received August 21, 2023
Revised October 17, 2023
past, research was primarily conducted with a focus on hypothesis confirmation using sample Accepted November 10, 2023
data. However, in the era of big data, this has shifted to gaining insights from the collected
data. Visualizing vast amounts of data to derive insights is crucial. For instance, leveraging big Corresponding author
Il-Youp Kwak
data for visualization can help identify and predict characteristics and patterns related to Department of Applied Statistics,
various infectious diseases. When data are presented in a visual format, patterns within the Chung-Ang University, 84 Heukseok-
data become clear, making it easier to comprehend and provide deeper insights. This study ro, Dongjak-gu, Seoul 06974, Korea
E-mail: [email protected]
aimed to comprehensively discuss data visualization and the various techniques used in the https://orcid.org/0000-0002-7117-7669
process. It also sought to enable researchers to directly use Python programs for data
visualization. By providing practical visualization exercises on GitHub, this study aimed to © 2023 The Korean Society of Endo-
facilitate their application in research endeavors. Laparoscopic & Robotic Surgery
This is an Open Access article distributed
under the terms of the Creative Commons
Attribution Non-Commercial License
Keywords: Big data, Data visualization, Matplotlib, Seaborn, Python (http:// creativecommons.org/licenses/
by-nc/4.0/) which permits unrestricted
non-commercial use, distribution, and
reproduction in any medium, provided
the original work is properly cited.
Effective communication In this study, we aimed to explain how to implement data visu-
Visualization is a powerful tool for conveying data to others. It alization using Python’s Matplotlib and Seaborn libraries. Practi-
allows complex concepts and results to be presented in a visu- cal code and data can be downloaded from GitHub for learning
al and intuitive manner, enabling smooth communication. Data purposes (https://github.com/soyul5458/Python_data_visual-
visualization promotes communication and collaboration among ization). The practical exercises were conducted using Google
experts from various fields, facilitating informed decision-making Colab, which is free and can be accessed anytime, anywhere
based on information. with a Gmail account, without the need to download any sepa-
In summary, data visualization is an essential tool for obtain- rate program.
ing meaningful insights from data, ultimately leading to better
outcomes. Matplotlib
168 https://doi.org/10.7602/jmis.2023.26.4.167
Mastering data visualization with Python: practical tips for researchers
x = [1, 2, 3] - solid
y = [21, 25, 33]
-- dashed
plt.plot(x, y)
-. dash-dot
# Setting the graph title.
plt.title("Line graph");
: dotted
Fig. 2. Setting the graph title.
# Adding a legend.
plt.plot(x, y, label="Mean
temperature (\u2103)")
plt.legend(loc="lower right");
Fig. 7. Setting the marker.
https://doi.org/10.7602/jmis.2023.26.4.167 169
Han and Kwak
Line graphs are frequently used to represent trends over # Step 2. Extracting Data by
Major
time. sub1 = df1.loc[df1["Major"]=='A']
sub2 = df1.loc[df1["Major"]=='B']
sub3 = df1.loc[df1["Major"]=='C']
# Load data
# Generating data iris = sns.load_dataset("iris")
year = [2010, 2012, 2014, 2016, iris.head()
2018, 2020, 2022]
n_std = [40, 60,55, 75, 62, # Drawing a single box plot
46,80] plt.figure(figsize = (10, 5))
plt.boxplot(iris['sepal_length']);
# Drawing bar plot
plt.bar(year, n_std);
Fig. 10. Vertical bar graph representation with Matplotlib. (2) Drawing multiple box plots (Fig. 14)
170 https://doi.org/10.7602/jmis.2023.26.4.167
Mastering data visualization with Python: practical tips for researchers
tables. The horizontal axis represents intervals, and the vertical cluded by inputting %.2f%% consecutively. A legend can also
axis represents frequencies. The hist() function can be used be added using the legend() function.
to create histograms that divide data values into equal inter-
vals called bins, and the size of bins affects the frequency and (1) Data preparation (Fig. 18)
shape of the histogram. import seaborn as sns
titanic = sns.load_
dataset("titanic")
total = titanic["sex"].value_
counts()
# Loading 'titanic' dataset from total
seaborn
import seaborn as sns
titanic = sns.load_ Fig. 18. Verification of data quantity.
dataset("titanic")
# Histogram
plt.hist(titanic.age);
(2) Drawing pie chart (Fig. 19)
can easily be drawn using the scatter() function. The markers plt.pie(total, labels=total.
index, autopct="%.2f%%");
can be changed to different shapes using the marker argument
described above.
# Data Preparation
import seaborn as sns
tips = sns.load_dataset("tips")
Fig. 20. Displaying percentages in a pie chart.
# Scatter plot
plt.scatter(tips.total_bill,
tips.tip)
plt.xlabel('Total bill') Seaborn
plt.ylabel('Tip');
Fig. 17. Visualization of scatter plot with Matplotlib. Figure-level vs. axes-level function
Seaborn functions can be broadly categorized into ‘figure-
6) Pie chart level’ and ‘axes-level’ functions. The three large boxes at the
Pie charts are used to visually display the overall proportions top in Fig. 21 (replot, displot, catplot) are figure-level functions,
of categorical data. They provide a convenient way to see the and the smaller boxes below are axes-level functions. Figure-
size and relative proportions of each section. The autopct argu- level functions create a Seaborn figure separately from Mat-
ment in the pie() function specifies the format of the numbers plotlib and perform plotting on that figure. Therefore, the layout
displayed inside the sectors. The value %.2f displays numbers can be changed using facetgrid (Seaborn’s figure). Axes-level
up to two decimal places. The percentage symbol can be in- functions, on the other hand, specify where to plot using the ax
https://doi.org/10.7602/jmis.2023.26.4.167 171
Han and Kwak
ecdfplot boxplot
histplot violinplot
pointplot
barplot
Fig. 21. Seaborn library structure.
172 https://doi.org/10.7602/jmis.2023.26.4.167
Mastering data visualization with Python: practical tips for researchers
parameter, and thus the layout can be changed using methods - s izes: Specifies the minimum and maximum size of the
such as plt.figure(). Table 2 summarizes the various graphs markers.
available in the Seaborn library.
(1) hue (Fig. 22)
Data type determines visualization
# Loading the 'tips' Sample Data
Table 3 summarizes the statistical analysis and visualization tips = sns.load_dataset("tips")
methods based on variable types. Understanding variable types # Using hue allows representing
different colors for each group.
is crucial during the visualization process because different sns.relplot(x="total_
bill", y="tip", data=tips,
graphic methods are used depending on the type. Visualization hue="smoker");
guides based on variables are well-documented on the Python
gallery page, so please refer to it (https://www.data-to-viz.com/). Fig. 22. Utilization of the hue option with Seaborn.
https://doi.org/10.7602/jmis.2023.26.4.167 173
Han and Kwak
(4) sizes (Fig. 25) (3) Estimate with different colors by category (Fig. 28)
# Specifying the range for marker
size
# In such cases, normalize the # Estimate with different colors
data within that range before based on time.
plotting. sns.kdeplot(x="tip", data=tips,
sns.relplot(x="total_bill", hue="time");
y="tip", data=tips, size="size",
sizes=(15, 200));
Fig. 25. Utilization of the sizes option with Seaborn. Fig. 28. Displaying estimations in different colors by categorical
variables.
2) Distribution plot: kernel density plot (Table 2, No. 4)
Seaborn provides additional features compared to simple 3) Distribution plot: rug plot (Table 2, No. 6; Fig. 29)
histograms in Matplotlib, such as kernel density, rug display, A rug plot is used to describe the distribution of data by
and multidimensional composite distribution. Among them, the showing data positions as small vertical lines (rugs) on the x-
kernel density plot displays a smoother distribution curve than axis.
a histogram by overlapping kernels. Both codes below are for
outputting kernel density plots.
# Rug plot
· s ns.displot(kind=“kde”) sns.kdeplot(x="tip", data=tips)
sns.rugplot(x="tip", data=tips);
· s ns.kdeplot(x, y, data, bw_adjust, cumulative)
-b
w_adjust: Adjusts the data interval for density estimation
(default = 1).
- cumulative:
If True, estimates the cumulative distribution Fig. 29. Visualization of rug plot with Seaborn.
function.
For kernel density plots, by assigning variable values to the 4) Categorical scatter plot: strip plot (Table 2, No. 8; Fig. 30)
y axis instead of the x axis, the graph can also be drawn hori- A strip plot is a graph that represents all data points as dots,
zontally or overlapped. Additionally, multiple variables can be similar to a scatter plot.
overlaid on one graph. Please refer to the GitHub practice code
for details.
# Stip plot
sns.stripplot(x="day", y="tip",
(1) Kernel density plot (Fig. 26) data=tips);
# Swarm plot
sns.swarmplot(x="day", y="tip",
# Drawing horizontally data=tips, s=3);
sns.kdeplot(y="tip", data=tips);
Fig. 27. Visualization of horizontal density plot with Seaborn. Fig. 31. Visualization of swarm plot with Seaborn.
174 https://doi.org/10.7602/jmis.2023.26.4.167
Mastering data visualization with Python: practical tips for researchers
Funding/support
None.
Data availability
Fig. 32. Visualization of box plot with Seaborn.
The data presented in this study are available at: https://github.
com/soyul5458/Python_data_visualization.
7) Categorical distribution plot: violin plot (Fig. 33)
A violin plot is a visualization of data distribution, resembling a
box plot combined with a kernel density plot. ORCID
Soyul Han, https://orcid.org/0000-0003-0156-250X
Il-Youp Kwak, https://orcid.org/0000-0002-7117-7669
REFERENCES
# Violin plot
sns.violinplot(x="day", y="tip",
data=tips);
1. Unwin A. Why is data visualization important? What is
important in data visualization? Harvard Data Sci Rev
2020;2:1-7.
2. Tukey JW. Exploratory data analysis. Addison-Wesley
Fig. 33. Visualization of violin plot with Seaborn.
Pub.; 1977.
3. Hunter JD. Matplotlib: a 2D graphics environment. Comput
CONCLUSIONS Sci Eng 2007;9:90-95.
4. Odegua R. DataSist: a Python-based library for easy
In conclusion, data visualization is essential. It presents com-
data analysis, visualization and modeling [Prepint].
plex concepts in an easy-to-understand manner, allowing for
arXiv:1911.03655; 2019. https://doi.org/10.48550/arX-
the identification of patterns and trends, gaining insights, and
iv.1911.03655
making better decisions more quickly. In the field of clinical re-
5. harkiran78. Top 10 libraries for data visualization in 2020
search, large amounts of data are being collected, and Python
[Internet]. GeeksforGeeks; 2020 [cited 2023 Oct 17]. Avail-
visualization tools are effective for visual representation. Using
able from: https://www.geeksforgeeks.org/top-10-libraries-
appropriate data visualization tools according to data types
for-data-visualization-in-2020/
can significantly improve the quality and impact of research,
6. Sial AH, Rashdi SYS, Khan AH. Comparative analysis
enabling readers to understand complex concepts easily.
of data visualization libraries matplotlib and seaborn in
Python. Int J Adv Trends Comput Sci Eng 2021;10:2770-
2281.
https://doi.org/10.7602/jmis.2023.26.4.167 175