Shortest Path Distance with Deep Learning

栏目: IT技术 · 发布时间: 3年前

内容简介:1.1 Motivation:Why do we need to use deep learning to1.2 Algorithm:Now that we know the ‘Here is a summary of the method proposed:

1. What does the paper say?

1.1 Motivation:Why do we need to use deep learning to approximate distance between nodes when we have traditional exact methods like Dijkstra’s and A* algorithms? The problem with these traditional algorithms is that they are slow on very large graphs and would consume a lot of memory to store the precomputed distances. And since for most of the applications an approximation of actual distance is good enough, it encourages one to explore various ways to approximate the distances. Also, once a neural network is trained, the inference time (finding node distance) is constant ( O(1) ).

1.2 Algorithm:Now that we know the ‘ why’ , let’s look into the ‘ how’. The paper uses the ideas presented in another excellent paper “ node2vec: Scalable Feature Learning for Networks ”. In fact, I would say that some of the ideas used in the paper under discussion were already presented in Node2Vec paper (for instance, the use of binary operators to represent an edge using corresponding node embeddings was proposed in Node2Vec paper, which was extended to represent path in this paper. We will discuss embeddings later). This paper is more of an application of Node2Vec. The Node2Vec itself is an extension of Word2Vec . Word2Vec is an algorithm to represent words with embeddings (vector of numbers) in a vector space, such that the semantically similar words are located closer to each other. It’s a fascinating topic in itself.

Here is a summary of the method proposed:

  1. Collect your graph data.
  2. Use Node2Vec algorithm to find node embeddings for each node. We don’t need to write this algorithm from scratch. The authors have provided an implementation .
  3. Use a certain number of nodes in the graph as, what they call, “landmarks” and compute their distances from all the rest of the nodes. Now you have samples of form ((landmark_i, node_x), distance).
  4. For each sample found above fetch the corresponding node embeddings of the landmark and the node, and combine them with any of the suitable binary operators (average, element-wise multiplication etc.). So now you should have samples of form (embedding, distance).
  5. Now you have input-output pairs, so you do what you do best. Find a good neural net configuration and train the hell out of the model. But we will see later, as with AI in general, it’s not that easy.

2. Implementation

The project has following folders:

data — which holds all the data consumed by the program, both downloaded and processed.

outputs —this holds all the outputs, including text logs, tensorboard logs, model backups etc.

src — source code is placed here.

tests — any relevant test cases

Note:Please note that for this project I have worked on notebooks mostly because it involved a lot of exploration and experimentation with various approaches. As such, work is not polished since I am still exploring better ways. I have tried to explain the cells as much as I could. Feel free to reach out to me if something is unclear.

2.1 Data

I have used Facebook datasets from here . The downloaded graph data is in “ mtx ” format. This is just another format for sharing matrix data. It looks like this:

%%MatrixMarket matrix coordinate pattern symmetric
6386 6386 217662
195 1
414 1
458 1
474 1
510 1
.
.
.

The first line is called “header” and defines some properties, which are used to decide how to parse the file to form the matrix. The line below it (size line) defines the size of data. The header always begins with “%%MatrixMarket” and the four fields that follow it are object, format, field, symmetry. Object field can be ‘matrix’ or ‘vector’ (in our case it will be matrix), format can be ‘coordinate’ or ‘array’. “coordinate” format stores only non-zero values in the mtx file, so the “size line” holds, number of rows, number of columns in the matrix and the number of non-zero entries. If format is “array”, the “size line” is of form [number of rows number of columns]. The next section is “data lines”, which hold the actual data. If format is “coordinate” there would be “number of non-zero entries” number of lines of data, else in case of “array” you should expect {number of rows * number of columns} number of data lines, where each line represents an edge between two nodes. The data lines could have a 3rd column “edge weight” for weighted graphs. The rest of the details can be found in above “mtx” link. I have shared this information to get you acquainted with it a little bit so that if you face parsing errors you can write your own implementation. Most of the times we may not need to, as Scipy supports reading/writing mtx format. But here we will need to first convert mtx to edgelist format as both Networkx (the package I use to process graph data) and Node2Vec script use this format.

Edgelist is just the data lines after stripping away the header, comments and size lines, like this:

2.2 Construct node embeddings

To calculate the node embeddings for our graph we will use the script provided by Node2Vec authors. But let me first introduce the algorithm briefly, since it’s quite interesting and also because it’s such an integral part of what we are doing here. Even if we don’t have to implement the algorithm, it shouldn’t stop us from learning about it. But you can choose to skip to next section.

2.2.1 Node2Vec

The idea is to sample the neighborhood of each node in the graph, also called a “walk” (which is a fancy way of saying: collect the near by nodes), convert the visited nodes’ list to string (so that now you have samples of form ([list of near by nodes as strings including source node])) and then pass all such lists to Word2Vec model to train and learn embeddings just like we train on list of sentences. But the best part of Node2Vec is how it samples the neighborhood. The authors argue that the classical approaches like BFS and DFS lie on two opposite ends of the spectrum. BFS tends to sample nodes closer to source node which causes the learned embeddings to capture structural similarity of nodes better (for e.g. whether a node acts as a hub/center in it’s neighborhood). But it only samples a small portion of the graph. On the other hand, DFS, tends to sample nodes far off from the source and thus learning from such ‘walks’ leads to embeddings which are better at capturing macroscopic view of graph (connectivity and “homophily” as then mention in the paper) but fail to capture finer details since walks are of finite length and they have a lot of grounds to cover.

Hence, they propose a new way of sampling a node’s neighborhood called “2nd order Random Walk”. Just to make it clearer, we are doing all this just to sample node neighborhood in a controlled way so that the walks include qualities of both BFS and DFS. Let’s consider a walk ‘ c’ of fixed length ‘ l ’ and assume you have started your walk from node ‘ u’ . Assume that you have traveled from node t to v . Instead of choosing the next node randomly or based on the edge of weights, they use the following probability distribution:

PD for next node, from node2vec paper

which means if previous node in the walk was v then the probability of next node being x is given by pi_{vx}/z (can’t use inline latex because Medium screws it up every time I edit something), if nodes v and x are connected by an edge, else 0. pi_{vx} is the unnormalized transition probability and z is the normalizing constant (which could be the sum of all the probabilities of edges from node v ). They define pi_{vx} as :

unnormalized probability- correction: it should be pi_{vx}

where w_{vx} is weight of edge between node v and x and $\alpha_{pq}(t,x)$ is:

definition of alpha

where d_{tx} is the shortest path distance between t and x . Let’s understand the role of p and q , because these are the two parameters which control the nature of random walk (BFS or DFS), hence the term “2nd order Random Walk”.

If we recap a bit, we reached at node v from node t , and now we need to decide which node to visit next from v . If you set p to be a low value then the walk would prefer to revisit the previous nodes. How, you ask? Imagine that you have reached from t to v , now you have following options — go to other neighbors of v (x1, x2…xn) or go back to t (don’t forget ‘t’ is also a neighbor), in which case d_{tx} is zero, since x here is t itself (corresponds to the first option 1/p in the above equation). So if p is a very low value then 1/p would be quite large and random walk would prefer to go back to previous nodes and thus simulate the behavior of BFS. Please refer to the (ugly) image below to visualize this:

random walk: next node decision at ‘v’

The paper has a similar image but without distance values. Similarly, if we choose a lower value for q then we are encouraging the algorithm to venture farther away from t by assigning higher probability to 3rd option in equation defining alpha. This kind of decision is made for every node in the walk. And several such walks are done.

If you have understood above explanation, you have an idea of how node2vec random walk works. You can also check the implementation of it on author’s GitHub repository (but it’s in python 2 format). In my project I have also included the author’s script but converted to Python3 format.

2.2.2 Running the node2vec script

Running the script is quite simple. You can choose to run with all the default values for it’s parameters or change it to suit your needs. I ran with default values:

python node2vec/main3.py --input ../data/socfb-American75.edgelist --output ../data/emb/socfb-American75.emd

Just modify the input and output paths.

2.3 Construct Dataset

From previous step we have a file with node and their embeddings in following format:

6386 128224 0.3206627 -0.0963422 -0.039677385 -0.03290484 0.4414181 ....4046 -0.036134206 -0.082308784 0.49080133 -0.36878866 0.13950641 ...
.
.
.

First two number represent number of nodes and dimension of embeddings respectively. Our next step is to select a certain number of nodes in the graph as landmarks and compute their distances from all the rest of the nodes. This would give us (number of landmarks*(number of nodes-1)) number of samples. We choose landmarks because finding distances for all the nodes would require a lot more compute.

find distance of each node from selected landmark nodes

“distance_map” dict holds distance of every node from a given landmark (as key). Now we need to get the corresponding node2vec embedding for each node/landmark pair and combine them to form a single embedding.

combine node and landmark embeddings

Here emd_map is a dict which holds each node as key and it’s embedding as value.

form ndarray from embedding-distance dict

Next, we need to form numpy ndarrays from embedding-distance dict like shown above. Please note two things from the output of the cell above — the number of samples is not equal to num_of_landmarks*(num_of_nodes-1). This is because I have ignored node pairs which have inf distance i.e. no path connecting them. Second, the size of training array is massive at ~927MB, since it is ‘float64’ type array. I have converted them to float32 to save space.

convert to float32

As you can see we could save ~50% space. If you are worried about the precision loss due to this conversion, you can verify the data loss:

measuring precision loss

This seems insignificant. Next, let’s look at the distribution of target variable.

distribution of distance values

As you can see distance values 2 and 3 really dominate the data. On the graph 6, 7, 8 can’t be seen but they are present in the data, with only sample for distance 8 which I dropped. Also note that in the paper, they ignore samples with distance value 1, but I have included them in training.

Since data is massively imbalanced, I have stratified the train/test split:

… and then normalized it:

I have saved the split data so that this preprocessing can be done once only. Whew!! Dataset formation was a big portion of this project and finally I have explained the major steps. I may have missed some, for more details you can check the data_prep.ipynb in the project repository.

2.4 Training the Neural Network

Now comes the exciting part. The code for training is in train.ipynb file. You will notice that some of the cells look unfinished/note-to-self. I have left these deliberately to let the interested readers gain experience from my failed attempts and the steps taken after each failure. I have even uploaded the tensorboard logs (which include changes made, result, notes etc) into the GitHub repo so that all the history is available for me and others to follow later. If you are not interested in the logs, you can delete the folders without any issues.

Since the data is skewed, I had to oversample the minority target values (I am not using the term “class” here because I have trained a regression model).

I have first undersampled the majority values and then oversampled the minority values. The intuition was to oversample the minority distance values to make them comparable to number of samples of majority values, while keeping the overall data size as small as possible. The fraction values “0.7” is not really calculated/derived from anywhere, it just seemed reasonable by looking at the frequencies of each distance values. And yes… don’t forget to shuffle the data after over/under sampling. It may seem like a trivial thing, something you can ignore but as it turned out many of the batches had same distance values (like all 1s or 2s etc) which made training impossible!

Now, let’s define a baseline, a simple model and it’s result to which we can compare to see how good we did:

setting a baseline with Linear Regression

Linear Regression does surprisingly good! 50% is a good baseline to compare with, considering that the chance prediction in this case is ~14% (1/7).

PyTroch Model

For all of my previous projects I had used Keras but recently I switched to PyTorch and I haven’t regretted it ever since. I was working on another project which involved writing a custom loss function which required some calculations on output of intermediate layer of the model and other custom stuffs. If you have ever tried such things with Keras, you know it’s not straightforward. Even with eager execution I couldn’t get it to work. To make matters worse, one fine day, Keras started throwing CUDA compatibility errors, and it wouldn’t go away no matter how many times I built new environment with required drivers. This was my breaking point. With PyTorch you may have to write little bit of extra code for training loop but it’s worth it. Now, I am not saying these issues couldn’t have been solved with more effort, but, in a world where you have PyTorch, why break your head? So that concludes my rant about why I switched to PyTorch.

But for reader who are not acquainted with PyTorch, don’t be disheartened. Almost all the components used are available in Keras (except for may be Cyclic LR scheduler, but there are implementations of it that you can use). Explaining PyTorch would be out of the scope of this article.

The model which gave best result (yet) has following configuration (with Cyclic LR Scheduler, RMSProp optimizer and Poisson Loss):

{'batch_size': 1000, 'input_size': 128, 'hidden_units_1': 200, 'hidden_units_2': 100, 'hidden_units_3': 50, 'do_1': 0.2, 'do_2': 0.1, 'do_3': 0.05, 'output_size': 1, 'lr': 0.001, 'min_lr': 1e-05, 'max_lr': 0.001, 'epochs': 500, 'lr_sched': 'clr', 'lr_sched_mode': 'triangular', 'gamma': 0.95}

It’s a 5 layer (3 hidden layers) model with dropouts, trained with cyclic learning rate.

You may not find this model in the GitHub code if I have tried other configurations after this and thus parameters could have changed. But you could find the corresponding model and parameters in run47 folder . Please note that this configuration is different from what they have mentioned in the paper. As with almost every other paper I have read, they have skipped the low level details of the NN.

Here is how you initialize Data Loaders for train/val/test data:

Now we are all set to train the model:

People who have only worked on Keras/Fastai yet, don’t get scared. Code to train PyTorch model is not always this big. I am doing a lot of other stuff like early stopping, checkpoint saves etc. It’s definitely more complicated than Keras, but nothing you can’t learn in a day or two. But like I have mentioned before all these components are already there in Keras (including Poisson Loss) so you can easily try it out on the framework of your choice.

I trained the model for ~110 epochs.

Here are the results:

Poisson Loss = -0.11
MAE Loss = 0.32
MSE Loss = 0.17
Accuracy = 76.40%
vs
Baseline :
MAE Loss = 0.59
MSE Loss = 0.56
Accuracy = 50.57%

Although this was a regression problem, I have still recorded accuracy because I find it more intuitive (it’s not a metric, just something I used to compare models). It’s not a boastful result but it’s not bad. Also, I think, there is lot of margin for more improvements, which I will mention later.

If you are wondering why does longer path distances have worse accuracy, similar observations were made by authors as well. Here’s what they say:

Observe that the larger errors caused by longer paths utilizing node2vec embeddings. In one hand, we do not have enough samples for longer distances in the training set. On the other hand, node2vec fails to learn structural features of faraway nodes.
Results from the paper. Notice the larger error values for longer path distances.

Note: There is one thing that’s bugging me about this model. Despite using exp_range as CLR mode, the learning rate didn’t change through out the training (at least as per the plot). This happens for gamma value 0.95 always. Need to look into it. Or if you have faced this before and have any solutions, you are welcome to share here.

In retrospect, here are some of the things that improved the model performance:

  1. Poisson Loss: I had tried MSE and MAE (used in the paper) earlier, but both didn’t work. All the ML communities suggested poisson loss for count data (integer) and it really worked.
  2. Batch Normalization: The training loss improvement was really slow before batch normalization. Before batch norm I was training for at least 1000 epochs before reaching anywhere near these figures.
  3. Under/Over sampling
  4. Cyclic LR Scheduler
  5. StandardScaler instead of MinMaxScaler
  6. Changing optimizer from SGD to RMSProp (SGD was used the paper)

3. Further Improvements

There are a lot of things that could be explored and thus requires a lot of patience. Here is a non-exhaustive list:

  1. Better Hyperparameters: Especially learning rate range and different network architecture.
  2. Try using Convolutional Neural Networks on this data. I am quite sure it will improve the results.
  3. Different node2vec embedding dimensions (we used 128).
  4. You may remember we used “average” operator to combine embeddings of both nodes. We could try other operators. In the paper, authors observe that —
“binary operators do not have a consistent behavior over different datasets and different dimension sizes. For instance, the average operator outperforms others in Facebook graph while concatenation works better for Youtube dataset”

5. Feature selection: We could try to reduce the number of features by selecting only important features. (not used in the paper)

6. Better sampling techniques: like KMeansSMOTE, SMOTE, Cluster Centroids etc. I tried using these techniques, but they took too long to finish.

7. There is a large gap in training loss and val loss, suggesting that we could reduce the over-fitting to get better result. May be try smaller model and see if it’s capable of learning.

4. Conclusion

Well, this has already become a long post, so I will just end it without much blabbering. This has been an interesting project with lots of cool concepts to learn. I couldn’t accommodate many of them here, but I have recorded and will record more fun facts about this problem, as and when I find them, in fun.ipynb file in the project. Thanks for sticking till the end. Happy coding!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算机图形学原理及实践:C语言描述(原书第2版) (平装)

计算机图形学原理及实践:C语言描述(原书第2版) (平装)

福利 / 唐泽圣 / 机械工业出版社 / 2004-3 / 95.0

《计算机图形学原理及实践:C语言描述(原书第2版)》:这是计算机图形学领域的一部经典之作,作者Fley、va Dam等是国际图形学界的著名学者、学术带头人,而且《计算机图形学原理及实践:C语言描述(原书第2版)》英文版自出版以来,一直是各国大学计算机图形学课程的主要教科书。来自清华大学、北京大学、中国科学院计算技术研究所、中国科学院软件研究所的多位图形学领域的专家和精英花费了大量的时间和精力进行翻......一起来看看 《计算机图形学原理及实践:C语言描述(原书第2版) (平装)》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具