Beginner's Guide to Machine Learning Processes
Clarifying the process of machine learning model-building
For the benefit of those who are unsure what machine learning (ML) is, here is a simple explanation. Machine learning is the process involved in "training" a computer with a set of data so that the machine "learns" to predict the result of a particular feature in that dataset. This article seeks to explain what machine learning is and describe the machine learning workflow from start to finish, to give beginner data scientists and engineers an overview of what the process of machine learning model building looks like, and how being familiar with the steps required, can help shape their understanding of this aspect of data science.
This is an introductory article on machine learning and is intended for new data scientists and engineers who are already familiar with the basics of Python programming and the use of Jupyter Notebook. It describes the machine learning model-building process, from start to finish. If you are not familiar with Python as a programming language or you do not know how to use the code and markdown cells in a Jupyter notebook, then you need to first get acquainted and familiarised with Python and Jupyter to help you follow this article with a more in-depth understanding.
The process of machine learning model building is divided into three stages which are listed below:
~ The data preparation stage
~ The model-building stage
~ The Results Communication stage
These stages in the machine learning process are all documented in a Jupyter Notebook. The Jupyter Notebook platform makes it possible to combine code and markdown cells so that machine learning projects can be comprehensively documented with code, text, and visuals in one file. Describing the flow of work from one stage to the next is the main subject of this article.
Data Preparation Stage: This stage involves the minor operations you have to perform on your dataset to get it in the proper state for further manipulation. These minor steps are in three actions you take on the data, and they are listed below:
~ Import
~ Explore
~ Split
The import step within the data preparation stage is where you clean your dataset and remove discrepancies like an inconsistent number of decimal placements, inconsistent formatting, the breakdown of crowded columns, deleting null values, etc. The "explore" step within the data preparation stage is where you analyze the various columns of your dataset and their correlation to one another. It is where you decide which columns to keep for use in the project and which to delete. The decision to delete certain columns is because they are not relevant to the set of operations to be performed on the dataset and leaving them will only introduce functional errors in the machine learning process The split step is the last step in the data preparation stage, and it is where you divide your data into target and features before you build your model. It is also called the "train-test split" and there are two types of splits. The horizontal and the vertical split. The horizontal split divides the dataset into the training segment and the test segment using the index numbering in the dataset. The vertical split is the division along columns, so a particular column becomes the "target" for predictions, while the other columns become the "features" on which the ML model is trained.
Model Building Stage: This is the second stage where you design the sequence of codes or algorithms to operate on your dataset and develop the model you are building. The three steps within the model-building stage are:
~ Baseline
~ Iterate
~ Evaluate
The Baseline step is the barest minimum performance you would expect from your model. It is set as a standard of performance that your model must either meet or surpass when it is eventually deployed. The Iterate step is where you build your model and train it by performing iterations and testing for optimal performance until you are satisfied with the outcome. The Evaluate step is where you make an assessment of the model’s performance when making predictions based on the iterations performed, in other words, the data it saw during training.
Results Communication Stage: This is where you share your model with an audience who may be colleagues or stakeholders in the project or just an audience who wants to assess your work. This communication is presented as a technical report which includes tables, graphs, and other data visuals, as well as a concise written report on the process and end results of the machine learning project. This is where you prove that you are not only good at developing ML models but also great at documenting the process and telling a good story with data.
If you have followed this article keenly from the beginning then you should be able to explain and describe the model-building process step by step in a few sentences. In summary, the machine learning process is in three stages which are the data preparation, the model building, and the results communication stages, which together constitute the ML process. It's imperative that having a mental picture of the whole process in mind, helps the data scientist or engineer maintain better focus from start to finish when working on machine learning projects.