AI Feynman 2.0: Learning Regression Equations From Data

栏目: IT技术 · 发布时间: 3年前

内容简介：Table of Contents1.2.

Let’s kick the tires on a brand new library

Jul 1 ·9min read

AI Feynman 2.0: Learning Regression Equations From Data — Image by Gerd Altmann from Pixabay (CC0)

Table of Contents

1. A New Symbolic Regression Library

I recently saw a post on LinkedIn from MIT professor Max Tegmark about a new ML library his lab released. I decided to try it out. The paper is AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity , submitted June 18th, 2020. The first author is Silviu-Marian Udrescu , who was generous enough to hop on a call with me and explain the backstory of this new machine learning library. The library, called AI Feynman 2.0, helps to fit regression formulas to data. More specifically, it helps to fit formulas to data at different levels of complexity (defined in bits). The user can select the operators that the solver will use from sets of operators, and the solver does its thing. Operators are things like exponentiation, cos, arctan, and so on.

Symbolic regression is a way of stringing together the user-specified mathematical functions to build an equation for the output “y” that best fits the provided dataset. That provided dataset takes the form of sample points (or observations) for each input variable x0, x1, and so forth, along with the corresponding “y”. Since we don’t want to overfit on the data, we need to limit the allowed complexity of the equation or at least have the ability to solve under a complexity constraint. Unlike a neural network, learning one formula with just a few short expressions in it gives you a highly interpretable model, and can lead to insights that you might not get from a neural network model with millions of weights and biases.

Why is this interesting? Well, science tends to generate lots of observations (data) that scientists want to generalize into underlying rules. These rules are equations that “fit” the observations. Unlike a “usual” machine learning model, equations of the form y=f(x) are very clear, and they can omit some of the variables in the data that are not needed. In the practicing machine learning engineer’s toolbox, regression trees would be the closest concept I can think of that implements this idea of learning an interpretable model that connects observations to a prediction. Having a new way to try and fit a regression model to data is a good addition to the toolbox of stuff you can try on your dataset.

In this article, I want to explore this new library as a user (how to use it), rather than a scientist (how does it work). AI-Feynman 2.0 reminds me of UMAP , in that it includes very fancy math on the inside of the solver, but does something useful to me in an abstracted way that I can treat as a black box. I understand that the code is going to be updated in stages over the next several months, and so the way the interface to the code looks today may not be the way it works when you are reading this. Hopefully, more documentation will also be added to give you a quick path to trying this on your data. For the moment, I’m including a notebook with this article so that you can dive in and get everything working from one place.

The library uses machine learning to help with the equation discovery, breaking the problem into subproblems recursively, but let’s not get too far into the weeds. Let’s instead turn our attention to using the library. You are welcome to read the paper to learn more about how the library does what it does to solve the symbolic regression mystery on your data.

2. Code

A Google Collab notebook containing all of the code for this article is available here:

dcshapiro/AI-Feynman

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

Some notes on the output are important. The solver prints many times the Complexity , RMSE, and Expression . Be aware that the RMSE number is not actually the Root Mean Squared Error. It is the Mean Error Description Length (MEDL) described in the paper, and that message will be changed soon. Also, the Expression printout is not the equation for the dataset, but rather for the sub-problem within the overall problem graph that the solver is currently working on. This is important because you will find that sometimes there is a printout that seems like it has a very low error, but it only applies to some subproblem and is not the equation you are trying to find. The final results are stored in the results folder using the name of the input file.

3. Try the First Example from the AI-Feynman Repository

Clone the repository and install the dependencies. Next, compile the Fortran code and run the first example dataset from the AI-Feynman repository (example1.txt from the repository).

The first few steps are listed here:

Next, put this file into the Code directory and run it with python3:

The first line of the example1.txt file is:

1.6821347439986711 1.1786188905177983 4.749225735259924 1.3238356535004034 3.462199507094163

Example 1 contains data generated from an equation, where the last column is the regression target, and the rest of the columns are the input data. The following example shows the relationship between the first line of the file example1.txt and the formula used to make the data.

We can see from running the code snippet above that the target “y” data points in example1.txt are generated using the equation on line 3, where the inputs are all the columns except for the last one, and the equation generates the last column.

Let’s now run the program. In the folder AI-Feynman/Code/ run the command python3 ai_feynman_magic.py to run the program we wrote above which in turn fits equations to the example1.txt dataset.

The solver runs for a long time, trying different kinds of equations at different levels of complexity, and assessing the best fit for each one. As it works through the solution, it prints intermediate results. If it hits a super low error you can stop the program and just use the equation. It’s really your call if you let it run to the end. For the input file example1.txt , the results show up in AI-Feynman/Code/results/solution_example1.txt . There are other spots where results are generated, but this is the place we care about right now. That file “ solution_…txt ” ranks the identified solutions. It’s funny that assuming y is a constant is a common strategy for the solver. Constants have no input variables, and so they have low complexity in terms of the number of bits. In the case of example 1, the equation ((x0-x1)**2 + (x2-x3)**2)**0.5 fit the best.

4. Try Our Own Simple Example

In the Collab notebook, I now moved the repository and data to Google Drive so that it will persist. The following code generates 10,000 examples from an equation. This example has 2 “x” variables and 2 duplicated “x” variables. Of course, y is still the output.

Plotting the first variable against Y we get:

Now that we took a peek at what our data looks like, let’s ask the solver to find a simple equation that fits our data, using our dataset. The idea is that we want the solver to notice that you don’t need all of the supplied variables in order to fit the data.

Here is an example of a permissions problem:

If you have file permission issues when you try to run the code, open up the file permissions like this:

chmod +777 AI-Feynman/Code/*

Below is the command to run the solver. Go get coffee, because this is not going to be fast…

python3 ai_feynman_duplicate_variables.py

If you have nothing better to do, watch the solver go. Notice the solver goes through a list of equation types before mixing it up. The initial models it tries out are quickly mapped to x0 and x2 as it “realized” x1 and x3 are duplicates and so not needed. Later on, the solver found the equation 3.000000000000+log(sqrt(exp((x2-x1)))) which is a bit crazy but looks like a plane.

We can see on WolframAlpha that an equivalent form of this equation is:

y=(x2 - x1)/2 + 3.000000000000

which is what we used to generate the dataset!

The solver settled on y=log(sqrt(exp(-x1 + x3))) + 3.0 which we know is a correct description of our plane, from the wolfram alpha thing above. The solver ended up using x1 and x3, dropping x0 because it is a copy of x1 and so not needed, and similarly dropping x2 because it is not needed when using x3.

Now, that worked, but it was a bit of a softball problem. The data has an exact solution, and so it didn’t need to fit noisy data, which is not a realistic real-world situation. Real data is messy. Let’s now add noise to the dataset and see how the library holds up. We don’t need to go as far as introducing missing variables and imputation. Let’s just make the problem a tiny bit harder to mess with the solver.

5. Symbolic Regression on Noisy Data

The following code creates points on the same plane as the previous example, but this time noise is added.

Note: In the notebook code I increased the dataset size to 100K samples (from the current 10K samples) to make the dataset size similar to example1. You don’t need to do that, and so I left this GIST as 10K samples.

The following figure shows how the duplicate columns are now no longer exact duplicates. Will the solver average the points with noise on them to get a better signal? I would average x0 and x1 into a cleaner point, and then average x2 and x3 into a cleaner point. Let’s see what the solver decides to do.

以上所述就是小编给大家介绍的《AI Feynman 2.0: Learning Regression Equations From Data》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

AI Feynman 2.0: Learning Regression Equations From Data

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

轻松学算法

赵烨 / 电子工业出版社 / 2016-7 / 69.00元

《轻松学算法——互联网算法面试宝典》共分为12 个章节，首先介绍了一些基础的数据结构，以及常用的排序算法和查找算法；其次介绍了两个稍微复杂一些的数据结构——树和图，还介绍了每种数据结构和算法的适用场景，之后是一些在工作与面试中的实际应用，以字符串、数组、查找等为例介绍了一些常见的互联网面试题及分析思路，便于读者了解这些思路，顺利地通过互联网公司的面试；最后介绍了一些常见的算法思想，便于读者对今后遇......一起来看看《轻松学算法》这本书的介绍吧!

码农工具