Building the “Hello World” of Kaggle projects using AutoAI

栏目: IT技术 · 发布时间: 3年前

内容简介:I always get asked what’s the best way to get started with data science? Well, my response to that is always the same, do a lot of data science courses to get familiar with the concepts and then move on to building projects by participating in Kaggle compe

I always get asked what’s the best way to get started with data science? Well, my response to that is always the same, do a lot of data science courses to get familiar with the concepts and then move on to building projects by participating in Kaggle competitions. In this blog, I will show how I started my Kaggle journey by completing the titanic survival Kaggle competition using AutoAI in just 30 minutes.

What are Kaggle competitions?

Building the “Hello World” of Kaggle projects using AutoAI

Photo by Justin Chan on Data-Driven Investor

Kaggle provides data science enthusiasts a platform for analytics competitions in which companies and researchers post data to allow enthusiasts to compete to produce the best models for predicting and describing the data. With over 13,000 datasets at present, Kaggle offers a veritable gold mine of data to work with.

Why IBM Watson AutoAI?

AutoAI in Watson Studio automates tasks that typically take data scientists days or weeks. All you need to do is submit your data and leave the rest on the tool to decide the best pipeline for your project.

Building the “Hello World” of Kaggle projects using AutoAI

Photo by Greg Filla on IBM Watson

Let the Kaggling begin!

The titanic project is known as the ‘hello world’ of Kaggle projects since beginners can get hands-on Kaggle experience before attempting complex projects.

Preparation:

  1. Set up AutoAI: — Create an IBM Cloud account. — Create a Watson Studio Instance . — Create a project — Add AutoAI to the project.
  2. Create an account on Kaggle
  3. Participate in the titanic survival competition

Step 1 — Data Collection:

We will have to download the dataset from the competition page . The zip file would contain 3 files- the train.csv file is the one we will use for training the model, the test is the file we will use to batch score our model and use it for submission purposes and the gender_ submission file shows how the submission file for Kaggle should look like (its the template we will fill out).

Step 2 — Data preparation:

The data is pretty much clean but there are null values that need to be taken care of in the training dataset and the testing dataset. So first, let’s replace the null values in the Age and Fare columns with the average of their values. I used the Excel formula ‘AVERAGE’ to find and replace the values. And I left the Cabin column to be null. While testing the model, this is being taken care of (to be continued …)

Step 3 — Model Building using AutoAI:

Although this step may seem hard, it's the easiest because we are using AutoAI. Simply create an AutoAI project in Watson Studio, give your experiment a name, and upload your train.csv file. Select ‘Survived to be your predicted variable, Run the experiment, and wait for at least 15 minutes.

Building the “Hello World” of Kaggle projects using AutoAI

AutoAI takes the dataset and the target variable to design pipelines (these are different models) using various HPO (hyperparameter optimization) parameters and enhanced feature engineering for each pipeline to get the best model.

Building the “Hello World” of Kaggle projects using AutoAI

As you might already know, there are different ways to evaluate and select the best model such as Accuracy, F1 score, precision, etc. You can edit this to suit your needs as well, but we will select accuracy for this case (you are free to try other evaluators as well). In the diagram below, you can see how each pipeline performed against different evaluators. This is how the best pipeline (leader pipeline) is selected and recommended to be deployed.

Building the “Hello World” of Kaggle projects using AutoAI

Now to deploy the model click the first pipeline (with a star), save as, and click model. This will let you deploy your model as a Watson machine learning model.

Go back to your project, click the WML model and deploy it as a web service and once it’s deployed you may test it and grab the scoring link which we will use in the next step.

Building the “Hello World” of Kaggle projects using AutoAI

Step4 — Batch scoring the model

Now that we have our model, let's create a python script to batch score the AutoAI model against our test.csv to submit our results.

Below is the code to run a batch score test for all the records in the test file.

The code basically stores each record in variables one by one so it can pass it as a payload to be scored by the model. Since some of the Cabin values are empty we replace them with None. Once we get the JSON result from the model, we parse it to get the predicted value for the record and store it in the array which would be written back to the Results.csv sheet.

Building the “Hello World” of Kaggle projects using AutoAI

Results.csv file

Final Step— Submitting the results to Kaggle

Go to the competition page and click on submit results to submit your Results file. Wait until you appear on the leadership board and your score(my score is 77%) along with the rank will be shown on the screen. The score determines your rank. You can keep improving your model and keep submitting multiple times to reach the top.


以上所述就是小编给大家介绍的《Building the “Hello World” of Kaggle projects using AutoAI》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

宇宙涟漪中的孩子

宇宙涟漪中的孩子

谢云宁 / 四川科学技术出版社 / 2017-11 / 28.00元

近未来。日冕科技公司通过建造围绕太阳的光幕搜集了近乎无穷的能源,这些能源主要用于地球上的网络空间建设。随着全球网络时间频率的不断提升,越来越多的人选择接驳进虚拟空间,体验现实中难以经历的丰富人生。 网络互动小说作者宁天穹一直自认为是这些人中普通的一员,有一天却被一名读者带进反抗组织,了解到日冕公司的各种秘密,并被告知自己的小说将在抵抗运动中起到重要作用。 起初他拒绝参与,但看到地球被笼......一起来看看 《宇宙涟漪中的孩子》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试