How to become a data scientist

by Javier Blanco
| 25 May, 2021
How to become a DS

(Or develop yourself if you’re already one).

 

I often get asked questions like:

  • How did you get into Data Science?
  • How do I transition from Industrial Engineering into Data Science? and,
  • What skills do I need to learn to be a good data scientist?

So I decided to write this blog to share my thoughts on what it takes to start a career in this area.

 

 

Similar articles worth checking:

https://blog.usejournal.com/post-3-ds-skillsets-d44babc1aadc
https://hub.packtpub.com/data-science-venn-diagram/

 

 

What is a Data Scientist?

  • Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills
  • Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018
  • More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review

 

How to become a Data Scientist?

The ideal Data Scientist has these three aspects well covered:

  • Maths
  • MLops
  • Domain Knowledge

All marinated with some useful soft skills: communication, creativity, business acumen, etc.

 

Data science venn diagram

 

 

Data Scientists Need Maths

You need sound knowledge of how different algorithms work and how to put them into practice.

 

As a beginner:

DataCamp: Great place to start from scratch without worrying about downloading libraries, IDEs, importing data, etc. Ideal from someone starting from zero. https://www.datacamp.com/

Kaggle: Lot’s of datasets, discussions, courses and even competitions available! https://www.kaggle.com/

Statquest is the best Machine Learning youtube channel out there. From zero to hero philosophy. So not just useful for newbies!

 

As someone with experience:

More practice with different types of problems will get you further, of course, but if you want to go ahead and look into the next big things in Data Science I suggest:

Causality: Or how to create models that don’t just use correlations between variables but look for actual causal relationships. A great way to start understanding why this is important is this free course by HarvardX. After that, you are ready to dive into Judea Pearl’s great masterpiece “The Book of Why”.

Online learning: Or how to create models that update (retrain) themselves over the air (in production) with latest data. A good way to introduce yourself to online learning is this post by Standford’s ML lecturer Chip Huyen. A good python library to try online learning is river.

 

MLOps

The term MLOps (short for ML Operations) is gaining traction to refer to all the software engineering work organisations are doing to put models into production, or to ingest the data to train your models with.

MLOps is really covering all the infrastructure, tooling and software that data scientists need to be more productive in a professional environment. Broadly it can include:

  • GIT for version control
  • CI/CD tools and methods
  • Kubernetes and Docker technologies
  • Pipeline tools
  • Monitoring tools
  • API’s
  • Serving tools

MLOps is not covered much in courses which focus more on the theoretical and practical topics around data preparation and modelling, so you’re going to have to learn it yourself.

There are two ways to go about this:

  1. Just learn MLOps.
    Duh! It’s not so easy though. Junior Data Scientists working in a good ML team will learn what they need on the job, and the best thing you can do is ask lots of questions.

     

    For aspiring Data Scientists or those in less organised teams there are limited resources out there and most require previous knowledge. Here you can find some good MLOps resources. This is another very comprehensive course (limited to Google Cloud Platform though).

  2. Use platforms that do MLOps for you.
    There are a number of commercial data platforms which include MLOps features. You can often signup for a free trial and just do some of their tutorials which will help you learn some of the MLOps subjects. Some of the free platforms include:

     

    1. DataBricks
      DataBricks is a very popular Machine Learning platform built by the original creators of Apache Spark. Spark is a large-scale data processing engine designed to extract large amounts of data from disk and serve it to large compute clusters in parallel.

       

      With a database at the core, DataBricks is best suited to batch processing of historic data. Having said that, the highly scalable Spark Engine does mean that data can be processed relatively quickly if enough storage and compute resources are deployed to the project. Spark also has a streaming library which supports near-real-time by batching data into chunks seconds to ten-of-seconds large.

      DataBricks provides Data Scientists with everything they need. You can easily get historic data from a database, or connect the database to a stream source like Kafka. There are notebooks for experimentation and clusters for training or model serving. There’s also nice integrations to a managed MLflow (an open source project created by teams at DataBricks) which lets you manage your experiments, models and deployments.

      I would check-out their 10 minute tutorials to get started with their 14-day free trial (this doesn’t include storage or compute resources).

    2. Dataiku
      Dataiku is a platform for Data Scientists and Data Analysts. Just like DataBricks it is built around historic data and allows you to connect to just about any storage technology.

       

      Once connected to your data you can use the platform to explore your data, prepare it, develop, train and deploy ML models and consume the results in applications. Dataiku includes many MLOps features like reusable code blocks that can be re-configured into data pipelines.Like the other options here, the user is abstracted off-of the underlying infrastructure so they don’t have to worry much about provisioning databases or compute resources (you still have to do it but it’s easy).

      Dataiku really wins on the user experience, taking simplicity and usability to the next level providing probably the most easy to use web-portal here. The platform supports developers working in SQL, Python, R and Spark. Dataiku does not have extensive support for streaming data in Kafka or time-series data types but it does have very good API’s for building applications from the results of your data processing.

      I would start with their online tutorial and then check out their free 14 day trial.

    3. Quix.ai
      Quix is a platform focused on real-time data streaming applications. It was built by McLaren F1 engineers who previously developed the systems now used by most F1 teams to stream and process data live during races.

       

      Quix is unique because it is architected around the Kafka message broker rather than a database. Kafka is a data storage technology which keeps your data in memory rather than writing it to disk. This has two benefits, firstly it results in almost no latency in your application, secondly it is more cost effective because there are less I/O operations on the disk.

      Quix allows you to create Kafka topics to where you’ll stream data. It also provides a code environment with integrated GIT and a simple way to serve your model to a serverless compute environment. You combine topics and projects to build pipelines by connecting components in a daisy-chain; this is also useful for re-usability.

      Under the hood they are using Docker and Kubernetes, but the Data Scientist is abstracted off this which is a great example of good MLOps practice. There are also comprehensive API’s which will give you experience producing data and consuming your model results to an application or dashboard.

      Quix is a real-time platform meaning that you can build an application that will have a round trip latency in the 10’s of milliseconds including processing time. It is therefore a good place to experiment with online learning.

      I would suggest you use their free tier (which includes 200 free credits per month) to try the crypto alerts tutorial. Here, you will build an application that reads real time currency exchange data from CoinAPI for your desired crypto and sends a SMS to your phone via Twilio’s API following certain logic (i.e. price reaches X threshold). This is quite complex in terms of MLOps, but I bet you will get it done in less than an hour, which is quite amazing!

 

A note of caution when reviewing job descriptions.

Be aware of any role asking the Data Scientist to build and maintain any of the MLOps components previously listed.

As mentioned, MLOps is predominantly a software engineering activity tasked with enabling Data Scientists to focus on the data and the code. You need to be aware of the components and know how to use them (depending on your seniority); you may even need to input into the design and build of MLOps systems, but you should not be required to actually build them yourself.

This is a warning sign of a company and/or team that either don’t know what they’re doing yet, or aren’t investing enough resources in their ML effort – either way it’s bound to fail.

 

Domain knowledge

This is basically knowing about your stuff. If you work for a telecoms company to predict churn, you will need to learn everything about how telecoms users behave and what triggers them to leave. If you work for an insurance company that calculates a client’s risk, again, you will need to spend lots of time learning about that.

Knowing about the problem you are modelling makes a huge difference, often bigger than knowing about the latest algorithm as you will be able to create better feature variables, pick the right algorithm better, etc. The importance of domain knowledge is sometimes underestimated and is one of the key requirements of becoming a great Data Scientist.

However, there’s not much you can do in advance unless you have a clear industry in mind to work for. If you do, Kaggle is a great place to try specific industry problems. You could also check out Reddit community groups where you’ll find like minded people with whom you can build domain specific solutions.

 

Conclusion

Anyhow, no one knows everything.

To become a great Data Scientist you need to focus on applying your Maths skills to the Domain. Nobody is born with domain expertise so the best thing you can do is choose an industry that interests you, then find a great team where you can learn continually about all the aspects covered in this blog.

Even among Data Scientists, some people are focusing on MLOps, largely because the industry is really just being invented. Now, some people differentiate between Type A Data Scientists (A for Analyst) and Type B Data Scientists (B for Builder) to refer to profiles that are stronger on Maths or MLOps respectively.

Whatever the type, the best organisations will support their Data Scientists with other specialist people, infrastructure and software so they can focus on applying their Maths skills to the domain.

Working in these organisations will mean you will often work on multi-disciplinary teams which is enriching for you and your company. So don’t worry if you feel stronger in one area or another, find a great team!

by Javier Blanco

Javier Blanco Cordero is Senior Data Scientist at Quix, where he helps customers to get the most out of their data science projects. He was previously a Senior Data Scientist at Orange, developing churn prediction, marketing mix modeling, propensity to purchase models and more. Javier is a masters lecturer and speaker, specializing in pragmatic data science and causality.

Related content