Setting Up Your Machine Learning Projects for Success

Thursday 5/24/18 03:00pm
Posted By Christine Jorgensen
  • Up0
  • Down0

I’ll never forget the acronym GIGO, which stands for Garbage In, Garbage out. I got the answer wrong in one of my early college courses, and to this day it remains embedded in my mind! It was a term coined by computer programmers to mean the integrity of the output is dependent on the integrity of the input. In today’s world where massive amounts of input data combined with artificial intelligence (AI) is increasingly used to output decisions that impact everything from our safety and security to our health and well-being, getting your data right is now more critical than ever before.

We thought it would be useful to review the process you and your team could undertake to structure a machine learning project from end to end. In doing so, we’ll point out the areas that are important to achieving an ideal outcome. If you still need a primer on the difference between AI, Machine Learning, and Deep Learning, check out our Developers Guide to AI eBook.

Determining the Question or Desired Outcome

Machine learning projects typically follow the same process as shown in the data science diagram below. First a question is posed around a business issue that requires some insight or an outcome to predict. This is typically done with a business unit that might work in conjunction with a data scientist to refine the question a few times, or to provide guidance for the rest of the process, especially around data collection. At this first stage it is important to review the question from several angles to help ensure a bias based on beliefs or limited context of the subject matter will not be introduced.

Setting Up Your Machine Learning Projects for Success

Identifying and Collecting Data

Next up is the data collection. Data comes in various forms including big data, small data, structured or unstructured data, IoT data, text, images, videos, sounds, and more. The data may already be on-hand, need to be acquired from elsewhere, or may need to be collected by your team extending the time and scope of the project.

A key part of the data collection and review process is known as exploration data analysis (EDA). This evaluation is necessary to determine if there is enough data, or if different types or sources are required. It then needs to be reviewed from a high level to identify any obvious correlations or significances. Ensuring the right context is also crucial. Here a specialist may be brought in, similar to a doctor with deeper experience in a particular area, or a behavioral scientist to identify techniques for enhanced customer support. Once this is done the data can be sorted and structured into the format required by your model such as the ONNX, the Open Neural Network Exchange format.

Wrangling Data for Form and Function

A time consuming and critical aspect of the project is wrangling the data, which is often the job of the data scientist on the team. This must be done before any data is entered into your model. The presence of ‘bad or garbage data’ in your model isn’t always based on malicious intent, more often it’s because this important step is missed. Data is prepared in a number of ways including cleaning (typos, bad strings, missing data, corrupt records), verifying the provenance or source of data, understanding the frequency of the input, and ensuring consistent labelling and taxonomy. If the data doesn’t seem right, don’t be afraid to ask if it has been cleaned and wrangled or much time can be wasted when you develop your model and deploy it.

Developing Your Model

With modeling, you decide on the parameters and a framework to analyze and train the data. Here you may decide to build your own model with languages like R or Python; or perhaps rely on some ready-made models through frameworks such as TensorFlow, Caffe 2 or ONNX. For applications where access to remote (e.g., cloud) data processing or AI may not always be available or desirable (e.g., in a car), optimizing your model for edge computing, where processing is done on hardware that is closer to your data source, might make more sense.

Putting It All Together: Deployment, Communication, And Output

Now it’s time to deploy the data through your model and review your results. Did it meet the expectations of you and your team? Before you communicate and share your results publicly, check if there is another evaluation of the data and your model. There are a number of tools to help present and visualize the data. Satisfied with the results? Now you or your management can use the output to make a decision or take some action.

We hope this overview gives you helpful guidance as a developer so you can provide leadership to your team in your Machine Learning projects. If you have any specific development tips, we’d love to hear them and share them with the rest of our Qualcomm Developer Network community. In addition, if you need help with processing your AI solutions on the edge, be sure to check out the Qualcomm(r) Neural Processing SDK along with the resources we have posted there.