Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

栏目: IT技术 · 发布时间: 3年前

内容简介:This article the idea of building a new,The premise of this study is based on the idea that great stocks to buy would be superior among their peers. In this context, I define a peer group as a group of companies that haveIf you haven’t read an introduction

This article the idea of building a new, data driven classification of companies based on their financials, instead of the type of business they do.

The premise of this study is based on the idea that great stocks to buy would be superior among their peers. In this context, I define a peer group as a group of companies that have similar financial structure (e.g. small cap, lots of debt, profitable).

If you haven’t read an introduction, I suggest you read my preface to this study here:

Bringing Up To Speed

In theprevious article, I’ve selected 5 dimensions to define financial structure of a company:

  • Profitability — to capture whether the company makes money or not
  • Total Assets — to capture company size
  • Debt to Equity Ratio — to capture financing structure
  • Operating Margin — to capture the difficulty of making profits
  • Earnings per Share Growth (EPS) — to capture year-to-year growth

I’ve also made sure that these dimensions are independent. This is an important consideration since I want to measure how a company is doing at each independent front regardless of its performance elsewhere. Here is a Spearman Correlation plot I’ve shown earlier:

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

Financial Data from Vhinny

Data Exploration

Before I go into clustering, let’s evaluate distributions for these 5 features. Understanding the distributions is crucial to select an appropriate clustering algorithm. It will also surface potential flows in the data which I will handle during pre-processing.

Below are the distributions of all five (5) features presented on a single plot:

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

Distributions for Select Features

The X axis shows the value of each feature. The Y axis shows the number of companies falling into each corresponding bin. For this analysis, each company is presented once with its financials reported in 2019. The total number of companies used in this study is ~4,000.

There are few observations that may attract one’s attention:

  • Negative Debt to Equity
  • Surprisingly high proportion Debt to Equity below 0.5
  • Skewed distributions of Assets and Debt to Equity

Negative Debt to Equity is uncommon, but possible. While debt cannot be negative, equity can be negative if total liabilities outgrow company’s total assets.

Skewed Distributionsare also okay, except we have to remember this when we choose a clustering algorithm. Some algorithms rely on the assumption of normal distribution, which would not hold here.

Most of the Debt to Equity ratios below 0.5 are indeed concerning. Lots of public companies rely heavily on debt financing which often outweighs stockholders’ equity. Upon further investigation, I found a lot of 0’s in this feature, suggesting potential flows in the data extraction process. Looking at all the features I’ve chosen, I would expect them all, except profitability, to be non-zero.

Having replaced all 0’s with nulls, we see significant change in the Debt to Equity ratios. The other features were not significantly affected:

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

Distributions for Select Features

This diagram is the same as the one before, except dark blue charts were constructed using only non-zero values . Having placed the blue charts on top of the red charts, the only tangible difference is in the Debt to Equity feature. I will trust the blue going forward, relying on my business sense.

Choosing the Clustering Algorithm

Clustering exercise generally is half art, half science. Since there is no consistent metric to evaluate performance of the algorithm, many algorithms are often tried before a working solution is found. While some distance measure may be used for comparison, it is still fairly abstract. A good overview of popular clustering methods may be foundhere.

The method I’ve chosen for this study is K-means clustering due to its computational efficiency, O(n) , and relatively trivial configuration. The only decision I have to make is for the number of clusters , which can be determined experimentally in a quite a streamlined way.

Final Steps Before I Cluster

During the Data Exploration phase, we’ve seen the skewed distributions for Assets and Debt to Equity features. Since K-means algorithm is based on finding optimal means in the data, skewed distributions are dangerous because outliers may pull these means uncontrollably in their direction.

One way to deal with this is to remove outliers. Since 4,000 examples is not an excessive amount of data, removing data points is not desirable. Instead, I can enforce max and min values in the distribution to reduce their distance to the mean.

One way to do this is to use quantiles and interquartile distance to cap the min and max values in the distribution. In this case, I’ve used

  • Q1–3*IQR to cap min
  • Q3+3*IQR to cap max

where Q1 and Q3 and the first (25%) and the third (75%) quantiles and IQR is the interquartile range. I’ve arbitrarily chosen a multiple of 3 for the IQR too keep the caps relatively far away from the median to not significantly disturb the original distribution.

Below is a sample code for this operation to better illustrate this approach.

This capping procedure was applied to all the features in the dataset. The resulting distributions for Assets and Debt to Equity ratio are shown below:

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.

The Effect of Capping Tails of the Original Distribution

Next Steps

This concludes data exploration and pre-processing phases of this analysis. In the next article, I will determine the optimal number of clusters based on the data and identify characteristic profiles for the companies.

By the way …

Let’s Connect!

I’m happy to connect with people who share my path towards financial independence . If you also search for financial independence or if you’d like to collaborate , bounce ideas or exchange thoughts, please reach out! Here are some places to find me:

Cheers!


以上所述就是小编给大家介绍的《Machine Learning is Stocks. Pre-Processing for Unsupervised Company Classification.》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

网络英雄传

网络英雄传

郭羽、刘波 / 江苏凤凰文艺出版社 / 2018-6 / 59.80元

“商战鬼才郭羽、营销奇才刘波强强联手,凝集十年实战经验,倾力打造商战巨作。” 这是一个商业竞争和资本激战交织的惊心动魄的创业交锋故事。 由郭天宇、刘帅共同创立的在线旅游公司万全天盛凭借其出色的商业模式异军突起,与老牌巨头“51旅游网”两强相争,但国际巨头通远来势汹汹,国内在线旅游市场进入战火纷飞的“三国杀”时代,分踞杭、沪、京三地互相“搏杀”。中国新兴的互联网公司面对国际巨头的入侵,毫不退缩......一起来看看 《网络英雄传》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具