Data Science Overview
Why Should We Care About Data Science?
What is Data Science?
Data Science is at the intersection between...
Domain Knowledge is knowledge about the specific area that is being investigated. It enables you to interpret the data correctly and derive insight. Examples of domains could be;
- Pharmaceuticals or medical testing, for medical test data
- Retail or sales, for point-of-sale data
- Manufacturing for production process defect data
Business Knowledge can be both general – about how businesses operate – or specific to terminology or processes that are unique to the business where the data science is being carried out.
Computer science is the study of mathematical algorithms and processes that interact with data and that can be represented as data in the form of programs. It enables the use of algorithms to manipulate, store, and communicate digital information.
In Data Science, computers are used to;
- Access data (often using SQL to query a database)
- Clean / Prepare the data
- Run/calculate the statistical models (Usually using Python or R, and use a range of code libraries to accelerate the approach)
- Explore the data
- Visualise the data
- Communicate the insights
Mathematics is the study of numbers, shapes and patterns. The word comes from the Greek word “μάθημα” (máthema), meaning “science, knowledge, or learning“. Maths enables data driven decision making, which is at the core of data science.
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of data. It is the practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole population, from those in a representative sample.
Statistics enables an understanding of probability and therefore the use of statistical models.
What Makes A Good Data Scientist?
Given data science sits at the intersection of Maths, Computer Science, and Domain/Business Knowledge, it follows that a 'good' data scientist must have skills to address each of these areas. The Venn Diagram illustrates how these skills overlap. If an individual has some of the skills, they may be well suited for a valuable role within the business, but they may not be prepared to be a data scientist.
Business Knowledge / Domain Knowledge is important because…
- It provides the context in which the data science activity occurs
- It enables the correct preparation and interpretation of the data
- It informs the subsequent questions that guide the data exploration and analysis
- It can help to get to the question that really matters to the business
- It guides focus on pursuing insights that the business is able to act upon, and the levers the business is able to pull, to influence the outcome.
Programming Skills are important because…
- The majority of data science is carried out with either Python or R programming languages, with SQL being commonly used to query databases.
- There is a wide range of Python and R code libraries that can be leveraged by those familiar with the respective language.
- There are a variety of tools that can enable non-programmers to carry out some data science activities, but they do not offer the same level of flexibility as being familiar with Python or R.
Knowledge of Statistics is important because…
- It informs the selection of the appropriate model to use
- It deepens the understanding of what the models and algorithms are doing
- It guides unbiased sampling techniques
- It enables the identification of statistically significant results
- Read more on Quora.
Communication skills are important because…
- Data scientists must listen and help the business project sponsor to get to the question they are really asking
- Data science projects usually require conversing with different stakeholders to gather data, context, and communicate the results
- Other business colleges often do not have the same familiarity with data science concepts and methods, and need it to be communicated clearly
- The insights and their consequences must be communicated clearly in order for action to be taken
What Method do Data Scientists Apply?
The Steps in the Data Science Process
- 1. Question Design
- 2. Data Collection
- 3. Data Preparation
- 4. Data Exploration
- 5. Model the Data
- 6. Identify Insights
- 7. Communicate
- Identify the question that you are trying to answer.
- What question matters to the business?
- Will you be able to obtain data for this question?
- Will be able to act upon the answer?
- What is your hypothesis statement that you are going to try and prove / disprove?
- Identify the data that you need to answer the question
- Identify what data exists / is actually available
- Should you start collecting additional data to answer the question?
- Gather the data that you need in order to answer the question
- Consider data sampling methods to ensure you have a representative sample
- Before you can explore / analyse the data effectively, you must ensure that your data is in a suitable format
- Data cleaning is an essential part of the data science process to ensure that the data is interpreted correctly
- Specific models and tools often require data to be in a specific format in order to function
- Do not underestimate the time required to prepare and clean the data: This is typically 80% of a data scientist’s time!
- Data exploration is aimed at creating a clear mental model and understanding of the data in the mind of the analyst
- It also reveals the defining characteristics of the data set that can be used in further analysis
- The cleaning process usually provides an initial understanding of the data
- Exploring the data usually involves visualising the data in a number of different ways
- Basic relationships between data sets can be identified before the modelling begins
A central part of the data science process is to model the data;
- Select the appropriate model
[Dedicated post coming soon]
- Build the model
- Fit the model
- Validate the model
- Iterate to improve the model
The insights, and the actions they enable, are the reason for the data science process;
- Identify the most important insights
- Understand the limitations of the conclusions from the data, model, and method
- Provide context and caveats so that the insights are understood correctly
- Identify areas for further investigation
Effectively communicating the insights and their consequences is the step that determines whether they are merely interesting, or impactful.
- Be clear on what the key take-aways are from your message
- Consider who your audience is, and what matters to them
- Be clear on the ‘WHY’ it matters to them
- What is the action that you suggest they take?
- What is the cost of not taking the action?