DSE 3 Unit 1
DSE 3 Unit 1
DSE 3 Unit 1
What is Data Science? Data science is a deep study of the massive amount of
data, which involves extracting meaningful insights from raw, structured, and
unstructured data that is processed using the scientific method, different
technologies, and algorithms. Data science is all about:
Asking the correct questions and analyzing the raw data.
Modelling the data using various complex and efficient algorithms.
Visualizing the data to get a better perspective.
Understanding the data to make better decisions and finding the final result.
Need for Data Science: Following are some main reasons for using data
science technology:
With the help of data science technology, we can convert the massive
amount of raw and unstructured data into meaningful insights.
Data science technology is opting by various companies, whether it is a big
brand or a start-up. Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for better customer
experience.
Data science is working for automating transportation such as creating a self-
driving car, which is the future of transportation.
Data science can help in different predictions such as various survey,
elections, flight ticket confirmation, etc.
Types of Data Science Job: If you learn data science, then you get the
opportunity to find the various exciting job roles in this domain. Some of the job roles
are given below:
1. Data Analyst
2. Machine learning expert
3. Data engineer
4. Data Scientist
1. Data Analyst: Data analyst is an individual, who performs mining of huge amount
of data, models the data, looks for patterns, relationship, trends, and so on. At the
end of the day, he comes up with visualization and reporting for analyzing the data
for decision making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics. You should also be familiar with some computer languages and tools
such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
2. Machine Learning Expert: The machine learning expert is the one who works
with various machine learning algorithms used in data science such as regression,
clustering, classification, decision tree, random forest, etc.
3. Data Engineer: A data engineer works with massive amount of data and
responsible for building and maintaining the data architecture of a data science
project. Data engineer also works for the creation of data set processes used in
modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB,
Cassandra, HBase, Apache Spark, Hive, MapReduce, with language knowledge
of Python, C/C++, Java, Perl, etc.
Skill required: To become a data scientist, one should have technical language
skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data
scientists must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right
questions. When any data science project is started, the first step is to determine
what are the basic requirements, priorities, and project budget. It is also required to
determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then the business problem on first
hypothesis level can be framed.
2. Data preparation: Data preparation is also known as Data Munging. In this
phase, we need to perform the following tasks:
Data cleaning
Data Reduction
Data integration
Data transformation
After performing all the above tasks, we can easily use this data for our further
processes.
Benefits of Git: A version control application allows us to keep track of all the
changes that we make in the files of our project. Every time we make changes in
files of an existing project, we can push those changes to a repository. Other
developers are allowed to pull your changes from the repository and continue to
work with the updates that you added to the project files. Some significant
benefits of using Git are as follows:
GitHub: GitHub is a Git repository hosting service. GitHub also facilitates with
many of its features, such as access control and collaboration. It provides a Web-
based graphical interface. GitHub is an American company. It hosts source code of
your project in the form of different programming languages and keeps track of the
various changes made by programmers. It offers both distributed version control and
source code management (SCM) functionality of Git. It also facilitates with some
collaboration features such as bug tracking, feature requests, task management for
every project.
M K Mishra, Asst. Prof. of Comp. Sc., FMAC, Bls. Page 7 of 10
Features of GitHub: GitHub is a place where programmers and designers work
together. They collaborate, contribute, and fix bugs together. It hosts plenty of open
source projects and codes of various programming languages. Some of its
significant features are as follows.
Collaboration
Integrated issue and bug tracking
Graphical representation of branches
Git repositories hosting
Project management
Team management
Code hosting
Track and assign tasks
Conversations
Wikisc
Benefits of GitHub: GitHub can be separated as the Git and the Hub. GitHub
service includes access controls as well as collaboration features like task
management, repository hosting, and team management. The key benefits of
GitHub are as follows.
It is easy to contribute to open source projects via GitHub.
It helps to create an excellent document.
You can attract recruiter by showing off your work. If you have a profile on
GitHub, you will have a higher chance of being recruited.
It allows your work to get out there in front of the public.
You can track changes in your code across versions.
Git vs GitHub:
Git is an open-source distributed version control system which is available for
everyone at zero cost. It is designed to handle minor to major projects with speed
and efficiency. It is developed to co-ordinate the work among programmers. The
version control allows you to track and work together with your team member at the
same workspace.
While GitHub is a Git repository hosting service. It is a web-based service.
GitHub facilitates with all of the features of distributed version control and source
code management (SCM) functionality of Git. It also supports some of its
characteristics in a single software tool.
To better understand the similarities and differences between Git and GitHub, look
at the following points.
Git Version Control System: A version control system is a software that tracks
changes to a file or set of files over time so that you can recall specific versions
later. It also allows you to work together with other programmers.
The version control system is a collection of software tools that help a team to
manage changes in a source code. It uses a special kind of database to keep track
of every modification to the code.
Developers can compare earlier versions of the code with an older version to fix the
mistakes.
Benefits of the Version Control System: The Version Control System is
very helpful and beneficial in software development; developing software without
using version control is unsafe. It provides backups for uncertainty. Version control
systems offer a speedy interface to developers. It also allows software teams to
preserve efficiency and agility according to the team scales to include more
developers. Some key benefits of having a version control system are as follows.
Complete change history of the file
Simultaneously working
Branching and merging
Traceability
Types of Version Control System
Localized version Control System
Centralized version control systems
Distributed version control systems