### Comprehending Convolutions on Graphs

This blog article by AICorespot deals with comprehending the building blocks and design choices of graph neural networks.

Several systems and interactions – social networks, molecules, organizations, citations, physical models, transactions – can be indicated very naturally as graphs. How can we reason with regards to making forecasts within these systems?

One notion is to take a peek at tools that have functioned well in other domains: neural networks have demonstrated massive forecasting power in an array of learning activities. However, neural networks have been conventionally leveraged to function on fixed-size and/or regular-structured inputs (like sentences, imagery and video). This makes them not capable of elegantly processing graph-structured data.

Graph neural networks (GNNs) are a grouping of neural networks that can function naturally on graph-structured data. Through extraction and utilization of features from the underlying graph, GNNs can make more informed forecasts about entities in these interactions , in contrast to models that consider individual entities in isolation.

GNNs are not the sole tools available to model-graph structured data: graph kernels and random-walk strategies were a few of the most widespread ones. Presently, however, GNNs have mostly supplanted these strategies due to their inherent flexibility to model the underlying systems better.

In this blog post, we will demonstrate the challenges of computing over graphs, detail the origin and design of graph neural networks, and explore the most widespread GNN variants as of late. Specifically, we will observe that several of these variants are made up of similar building blocks.

To start with, let’s talk about some of the complications that graphs carry with them.

**The Challenges of Computation on Graphs**

**Lack of Consistent Structure**

Graphs are really flexible mathematical models; but this implies they don’t have consistent structure across examples. Take up the activity of forecasting whether a provided chemical molecule is toxic.

Observing a few instances, the following problems become immediately apparent:

- Molecules might have differing numbers of atoms.
- The atoms within a molecule might be of differing types.
- Each of these atoms might have differing number of connections.
- These connections can possess differing strengths.

Representation of graphs in a format that can be computed over is non-trivial, and the final representation selected usually is dependent considerably on the actual problem.

**Node-Order Equivariance**

Extension to the point above: graphs usually possess no inherent ordering present amongst the nodes. In contrast to imagery, where every pixel is uniquely decided by its absolute position in the image.

As an outcome, we would like our algorithms to be node-order equivariant: they should not be dependent on the ordering of the nodes of the graph. If we undertake permutation of the nodes in some fashion, the outcome representations of the nodes as computed by our algorithms should also undergo permutation in the same manner.

**Scalability**

Graphs can be impressively large. Think about social networks such as Facebook or Twitter, which have more than one billion users. Working on data this massive is not simple.

Luckily, most naturally occurring graphs are ‘sparse’: they have a tendency to have their number of edges linear in their number of vertices. We will observe that this enables the leveraging of intelligent strategies to effectively compute representations of nodes in the graph. Further, the strategies that we will observe here will have considerably lesser parameters in contrast to the size of the graphs they operate on.

**Problem Setting and Notation**

There are several useful issues that can be formulated over graphs:

- Node Classification: Classification of individual roles.
- Graph Classification: Classification of entire graphs.
- Node Clustering: Grouping together similar nodes on the basis of connectivity.
- Link forecasting: Forecasting missing links.
- Influence Maximization: Identification of influential nodes.

A typical precursor in identifying a solution to several of these issues is node representation learning: learning to map individual nodes to fixed-size real-valued vectors (referred to as ‘representations’ or ‘embeddings’)

In Learning GNN parameters, we will observe how the learnt embeddings can be leveraged for these activities.

Differing GNN variants are distinguished by the manner in which these representations are computed. Typically, however, GNNs compute node representations in an iterative procedure. We will leverage the notation hv^{(k) }to signify the representation of node v following the kth iteration. Every iteration can be perceived of as the equivalent of a ‘layer’ in traditional neural networks.

We will go about defining a graph G as a grouping of nodes, V, with a grouping of edges E connecting them. Nodes can possess separate features as portion of the input: we will denote by x_{v }the individual feature for node v ∈*V. *For instance, the ‘node features’ for a pixel within a colour image would be the red, green and blue channel (RGB) values at that pixel.

For simplicity of exposition, we will assume *G *is undirected, and all nodes are of the same variant. These variants of graphs are referred to as ‘homogenous’. Several of the same ideas we will observe here are applicable to other variants of graphs: this will be detailed later in a separate blog.

At times we will be required to denote a graph attribute by a matrix M, where every row M_{v} signifies an attribute corresponding to a specific vertex v.

**Extensions of Convolutions to Graphs**

Convolutional Neural Networks have been observed to be very potent in extraction of features from images. However, images themselves can be observed as graphs with a very traditional grid-like structure, where the individual pixels are nodes, and the RGB channel values at every pixel as the node features.

A natural idea, then, is to take up generalizing convolutions to arbitrary graphs. Remember, however, the challenges specified are detailed in the prior section: In specific, ordinary convolutions are not node-order invariant, as they are dependent on the absolute positions of pixels. It is at first, not obvious as how to generalize convolutions over grids to convolutions over general graphs, where the neighbourhood structure is different from node to node. The curious reader might wonder if carrying out some kind of padding and ordering could be done to make sure that the consistency of neighbourhood structure across nodes. Efforts have been made with varying degrees of success, but the strategies we observe here are more general and potent.

We start by putting forth the notion of constructing polynomial filters over node neighbourhoods, a lot like how CNNs compute localized filters over neighbouring pixels. Then, we will observe how more updated approaches extend on this idea with more potent mechanisms. Lastly, we will speak about alternative strategies that can leverage ‘global’ graph-level data for computation of node representations.

**Polynomial Filters on Graphs**

**The Graph Laplacian**

Provided a graph *G, *let us rectify a random ordering of the n nodes of *G.* We signify the 0 – 1 adjacency matrix of *G *by *A*, we can build the diagonal degree matrix *D *of *G *as:

Where *A _{uu} *signifies the entry in the row correlating to

*v*and the column corresponding to

*u*in the matrix

*A*. We will leverage this notation throughout this section.

Then, the graph Laplacian L is the square n x n matrix defined as L = D – A

The graph Laplacian traces its etymology from being the discrete analog of the Laplacian operator from Calculus.

Even though it encodes precisely the same data as the adjacency matrix *A *in the sense that provided either of the matrices *A *or *L*, you can build the other., the graph Laplacian has several fascinating attributes of its own. The graph Laplacian shows up in several mathematical problems consisting graphs: random walks, spectral clustering, and diffusion, to specify a few. We will observe a few of these attributes in a later portion of the blog, but will rather point readers to tutorials that are available online.

**Polynomials of the Laplacian**

Now that we have comprehended what the graph Laplacian is, we can construct polynomials of the form:

*P _{w}(L) = w_{0}I_{n} + w_{1}L + w_{2}L^{2} + … + w_{d}L^{d }= *

Every polynomial of this form can alternatively be signified by its vector of coefficients *w = [w _{0, }*……..,

*w*Observe that for every

_{d}]*w, p*

_{w}(L)_{}is an n x n matrix, much like L.

These polynomials can be perceived of as the equivalent of ‘filters’ in CNNs, and the coefficients *w *as the weights of the ‘filters’

For simplicity of exposition, we will concentrate on the case where nodes possess one-dimensional features: each of the x_{v} for v ∈*V *is only a real number. The same notions hold when each of the x_{v}, are higher-dimensional vectors, also.

Leveraging the prior selected ordering of the nodes, we can stack all of the node features x_{v} to obtain a vector x ∈R*n*

After we have built the feature vector x, we can go about defining its convolution with a polynomial filter p_{w} as:

X^{’} = p_{w}(L)x

To comprehend how the coefficients w impact the convolution, let us start by taking up the ‘simplest’ polynomial. When w_{0 }= 1 and all of the other coefficients are nil. In this scenario, x^{’} is just x:

Presently, if we amp up the degree, and consider the case where rather w_{1} = 1 and all of the other coefficients are nil. Then x^{’ }is just Lx and so:

We observe that the features at every node v are brought together with the features of its immediate neighbours u ∈N(*v*). For readers acquainted with Laplacian filtration of imagery, this is precisely the same idea. When x is an image, x^{’} = L_{x }is precisely the outcome of applying a ‘Laplacian filter’ to x.

At this juncture, a natural question that will prop up is: How does the degree d of the polynomial influence the behaviours of the convolution? Indeed, it is not too difficult to demonstrate that: This is Lemma 5.2 from

Dist_{G}(v, u) > I →

This carries with it the implication that when we convolve x with p_{w}(L) of degree d to get x^{’}:

Essentially, the convolution at node v happens just with nodes u which are not more than d hops away. Therefore, these polynomial filters are localized. The degree of the localization is governed totally by d.

To assist you in comprehending these ‘polynomial-based’ convolutions in a better way, we have developed the visualization below. Vary the polynomial coefficients and the input grid x to observe how the outcome x^{’} of the convolution modifications. The grid under the arrow displays the equivalent convolutional kernel applied at the highlighted pixel in x to obtain the outcome pixel in x^{’}. The kernel correlates to the row of p_{w}(L) for the highlighted pixel. Observe that even upon alterations for position, this kernel is differing for differing pixels, dependent on their position within the grid.

**ChebNet**

ChebNet refines this notion of polynomial filters by observing polynomial filters of the form:

Where T_{i} is the degree-i Chebyshev polynomial of the first kind and *L *is the normalized Laplacian defined leveraging the biggest eigenvalue of L: we speak about the eigenvalues of the Laplacian L in additional detail in an upcoming section.

Where T_{i} is the degree-i Chebyshev polynomial of the first kind and L is the normalized Laplacian defined leveraging the biggest eigenvalue of L: We speak about the eigenvalues of the Laplacian L in comprehensive detail in an upcoming section.

What is the motivation underlying these choices?

- L is actually positive semi-definite: all of the eigenvalues of L are not smaller than 0. If
*λ*max(*L*)>1, the entries in the powers of L swiftly grow in size. L is essentially a scaled down variant of L, with eigenvalues ensured to be in the range [-1,1]. This averts the entries of powers of L from blowing up. Indeed, in the visualization above: we limit the higher-order coefficients when the unnormalized Laplacian L is chosen, but facilitate bigger values when the normalized Laplacian L is chosen, in order the demonstrate the same outcome x^{’ }on the same color scale. - The Chebyshev polynomials have specific fascinating attributes that make interpolation more numerically stable. We won’t speak about this in extra depth here, but will advise fascinated readers to have a peek at as a definitive resource.

**Polynomial Filters are Node-Order Equivariant**

The polynomial filters we took into account here are actually independent of the ordering of the nodes. This is specifically simple to observe when the degree of the polynomial *p _{w} *is 1: where each node’s feature is aggregated with the total of it’s neighbour’s features. Obviously, this sum is not dependent on the order of the neighbours. A similar proof follows for higher degree polynomials: the entries in the powers of L are equivariant to the ordering of the nodes.

**Info for the ones whose curiosity is piqued**

As above, let’s go by the assumption that an arbitrary node-order over the n nodes of our graph. Any other node-order can be perceived of as a permutation of this original code-order. We can signify any permutation through a permutation matrix P. P will always be an orthogonal 0 -1 matrix:

*PP ^{T} = P^{T }P = I_{n}*

Then, we refer to an algorithm f node-order equivariant iff for all permutations P:

*f(Px) = P f (x)*

When swapping to the new node-order leveraging the permutation P, the quantities below transform in the subsequent way:

*x →** P _{x}*

*L →** PLP ^{T}*

*L ^{i} →*

*PL*

^{i}P^{T}and so, for the scenario of polynomial filters, we can observe that:

As required.

**Embedding Computation**

We can now detail how we can construct a graph neural network by stacking ChebNet (or any polynomial filter) layers one upon the other with non-linearities, much like a traditional CNN. Specifically, if we have K differing polynomial filter layers, the kth of which has its own learnable weights w^{(k)} we would carry out the following computation:

Observe that these networks re-leverage the same filter weights across differing nodes, precisely mimicking weight-sharing in Convolutional Neural Networks (CNNs) which reuse weights for convolutional filters across a grid.

**Modern Graph Neural Networks**

ChebNet was a breakthrough in learning localized filters over graphs, and it inspired several to contemplate about graph convolutions from a differing viewpoint.

We return back to the outcome of convolving x by the polynomial kernel p_{w}(L) = L, concentrating on a specific vertex v:

As we specified prior, this is a 1-hop localized convolution. However, more critically, we can think of this convolution as originating out of dual steps:

- Aggregating over immediate neighbour features x
_{u} - Combining with the node’s own feature x
_{v}

Critical idea: What is take up differing variants of ‘aggregation’ and ‘combination’ steps, beyond what is feasible leveraging polynomial filters?

By making sure that the aggregation is node-order equivariant, the cumulative convolution becomes node-order equivariant.

These convolutions can be perceived of as ‘message-passing’ amongst adjacent nodes: following each step, each node receives some data from its neighbours.

Through iteratively repeating the 1-hop localized convolutions K times (that is, repeatedly ‘passing messages’), the receptive field of the convolution basically integrates all nodes upto K hops away.

**Embedding Computation**

Message-passing forms the spine of several GNN architectures today. We describe the most widespread ones in depth below:

- Graph Convolutional Networks (GCN)
- Graph Attention Networks (GAT)
- Graph Sample and Aggregate (GraphSAGE)
- Graph Isomorphism Network (GIN)

**Reflections**

A fascinating point is to evaluate differing aggregation functions: are some superior options and others worse off? This illustrates that aggregation functions as a matter of fact can be contrasted on how well they can uniquely maintain node neighbourhood features; we suggest that the fascinated reader take a look at the comprehensive theoretical analysis there.

Here, we speak about GNNs where the computation only happens at the nodes. More latest GNN models like Message-Passing Neural Networks and Graph Networks carry out computation over the edges additionally; they compute edge embeddings together with node embeddings. This is an even more generalized framework – but the same ‘message passing’ ideas from this portion of the blog are applicable.

**Interactive Graph Neural Networks**

Practically, every iteration above is typically perceived of as a singular ‘neural network layer’. This ideology is adhered to by several popular Graph Neural Network libraries, for instance: PyTorch Geometric and StellarGraph enabling one to compose differing variants of graph convolutions in the same model.

**From Local to Global Convolutions**

The strategies we’ve observed thus far carries out ‘local’ convolutions: each node’s feature is updated leveraging a function of its local neighbour’s features.

While carrying out enough steps of message-passing will ultimately make sure that data from all nodes in the graph is passed, one might wonder if there are more direct ways to carry out ‘global’ convolutions

The answer is in the positive, we will now detail a strategy that was actually first put forth in the context of neural networks by, much before any of the GNN models we looked at above.

**Spectral Convolutions**

As prior, we will concentrate on the case where nodes have 1D features. After selecting an arbitrary node-order, we can stack all of the node features to obtain a feature vector *x*∈R*n*.

Critical idea: Provided a feature vector x, the Laplacian L facilitates us to quantify how smooth x is, with regards to G.

How?

Upon normalization of x such that ∑*i*=1*n**x**i*2=1, if we look at the following quantity consisting of L:R_{L }is formally referred to as the Rayleigh quotient.

we immediately observe that feature vectors x that allocate similar values to adjacent nodes in G (therefore, are smooth) would have lesser values of R_{L}(x)

L is a real, symmetric matrix, which implies it has all real eigenvalues *λ*1≤…≤*λ**n* An Eigenvalue *λ* is a value satisfying the equation Au = *λu* for a specific vector u, referred to as an eigenvector. Further, the corresponding eigenvectors u_{1}, …., u_{n} can be taken to be orthonormal:

It happens that these eigenvectors of L are successively less smooth, as R_{L} signifies: This is the min-max theorem for eigenvalues.

The grouping of eigenvalues of L are referred to as ‘spectrum’, therefore the name! We denote the ‘spectral’ decomposition of L as:

Where Λ is the diagonal matrix of sorted eigenvalues, and U signifies the matrix of the eigenvectors (sorted according to increasing eigenvalues):

The orthonormality condition amongst eigenvectors provides us that U^{T} U = I, the identity matrix. As these n eigenvectors for a foundation for R^{n}, any feature vector x can be signified as a linear combo of these eigenvectors:

Where x is the vector of coefficients [x_{0, }… x_{n}]. We call x as the spectral representation of the feature vector x. The orthonormality condition facilitates us to specify:

This pairing of equations enables us to interconvert between the ‘natural’ representation x and the ‘spectral’ representation x for any vector *x*∈R*n*

**Spectral Representations of Natural Images**

As detailed prior, we can consider any imagery as a grid graph, where every pixel is a node, linked by edges to adjacent pixels. Therefore, a pixel can have either 3, 5, or 8 neighbours, dependent on its location within the image grid. Every pixel gets a value as part of the image. If the image is grayscale, every value will be a singular real number signifying how dark the pixel is. If the image is coloured, every value will be a 3D vector, signifying the values for the green, red, and blue (RGB) channels. We leverage the alpha channel also in the visualization below, so this is in actuality RGBA.

This construction facilitates us to compute the graph Laplacian and the eigenvector matrix *U*. Provided an image, we can then investigate what its spectral representation appears like.

To shed some light on what the spectral representation actually encodes, we carry out the following experiment over every channel of the image independently:

- We initially gather all pixel value across a channel into a feature vector x.
- Then, we obtain its spectral representation x

- We truncate this to the first m components to get x
_{m}. Through truncation, we imply zeroing out all over the remaining n – m components of x. This truncation is equivalent to leveraging just the first m eigenvectors to compute the spectral representation.

- Then, we convert this truncated representation x
_{m }back to the natural basis to get x_{m}.

X_{m} = Ux_{m}

Lastly, we stack the outcome channels back together to get back an image. We can now observe how the outcome image modifies with choices of m. Observe that when m = n, the outcome image is identical to the original image, as we can reconstruct every channel precisely.

As m reduces, we can observe the output image x_{m }gets blurrier. If we reduce m to 1, the output image x_{m} is completely the same colour throughout. We observe that we do not require to retain all *n *components: we can retain a lot of the data within the image with considerably fewer components. We can relate this to the Fourier decomposition of images: the more eigenvectors we leverage, the higher frequencies we can signify on the grid.

To complement the visualization above, we also visualize the first few eigenvectors on a smaller 8 x 8 grid below. We modify the coefficients of the first 10 out of 64 eigenvectors in the spectral representation and observe how the outcome image alters:

Such visualizations should provide you with compelling proof that the initial eigenvectors are as a matter of fact, smooth, and the smoothness correspondingly reduces as we take up later eigenvectors.

For any image x, we can perceive of the initial entities of the spectral representation x as capturing ‘global’ image-wide trends, which are the low-frequency components, while the later entries as capturing ‘local’ details, which are the high-frequency components.

**Embedding Computation**

We now have the foundation to comprehend spectral convolutions and how they can be leveraged to compute embeddings/feature representations of nodes.

As prior, the model we detail below possesses K layers: every layer k has learnable parameters w^{(k)} referred to as the ‘filter weights’. These weights will be convolved with the spectral representations of the node features. As an outcome, the number of weights required in every layer is equivalent to *m*, the number of eigenvectors leveraged to compute the spectral representations. We had demonstrated in the prior section that we can take m << n and still not lose out on considerable amounts of data.

Therefore, convolution in the spectral domain facilitates the utilization of considerably lesser parameters that only direct convolution in the natural domain. Further, through virtue of the smoothness of the Laplacian eigenvectors across the graph, leveraging spectral representations automatically enforces an inductive bias for neighbouring nodes to obtain similar representations.

Going by the assumption that 1D node features for now, the output of every layer is a vector of node representations h^{(k)}, where every node’s representation corresponds to a row of the vector.

We rectify an ordering of the nodes in *G*. This provides us the adjacency matrix *A *and the graph Laplacian *L, *facilitating us to compute *U*_{m }Lastly, we can detail the computation that the layers carry out, one after the other:

The strategy above generalizes easily to the case where each *h ^{(k)} * ∈R

*d*

*k* also: see for details.

With the insight we obtained from the prior section, we observe that convolution in the spectral-domain of graphs can be perceived of as the generalization of convolution in the frequency-domain of images.

**Spectral Convolutions are Node-Order Equivariant**

We can display spectral convolutions are node-order equivariant leveraging a similar strategy as for Laplacian polynomial filters.

As with our proof prior, let’s rectify an arbitrary node-order. Then, any other node-order can be signified by a permutation of this original node-order. We can associate this permutation with its permutation matrix, P. With this fresh node-order, the quantities below transform in the following way:

*x* → *Px*

A → *PAP**T*

*L →** PLP ^{T}*

*U _{m} →*

*P U*

_{m}Which carries with it the implication that, in the embedding computation:

*x*^ → (*PU**m*)*T*(*Px*)=*U**mT**x*=*x*^

*w*^ → (*PU**m*)*T*(*Pw*)=*U**mT**w*=*w*^

g^ → g^

g → (P U_{m}) g^ = P (U_{m}g) = P g

Therefore, as *σ* is applied by element.

*f(P x) = **σ(P g) = P σ **(g) = P f(x)*

as needed. Further, we observe that the spectral quantities x^, w^, and g^ are unmodified by permutations of the nodes. Formally, the are what we would call node-order invariant.

The theory of spectral convolutions is mathematically well-grounded; but, there are some critical drawbacks that we must speak about:

- We are required to compute the eigenvetor matrix U
_{m}from*L*. For big graphs, this becomes very infeasible. - Even if we can compute
*U*global convolutions themselves are ineffective to compute, due to the repeated multiplications with_{m},*U*_{m}and*U**mT*. - The learned filters are particular to the input graphs, as they are signified in terms of the spectral decomposition of input graph Laplacian L. This implies they do not transfer well to fresh graphs which have considerably differing structure (and therefore, considerably differing eigenvalues)

While spectral convolutions have mostly been supplanted by ‘local’ convolutions for the reasons detailed above, there is still much advantage to comprehending the ideas underlying them. Indeed, a recently proposed GNN model referred to as Directional Graph Network actually leverages the Laplacian eigenvectors and their mathematical attributes comprehensively.

**Global Propagation through Graph Embeddings **An easier way to integrate graph-level data is to compute embeddings of the complete graph by pooling node (and potentially edge) embeddings, and then leveraging the graph embedding to go about updating node embeddings, followed by an interative scheme a lot like what we’ve observed here. This is a strategy leveraged by Graph Networks. We will briefly speak about how graph-level embeddings can be built in Pooling. But, such strategies have a tendency to ignore the fundamental topology of the graph that spectral convolutions can capture.

**Learning GNN Parameters**

All of the embedding computations we’ve detailed here, whether they’re spatial or spectral, are totally differentiable. This enables GNNs to be trained in an end-to-end fashion, just like a traditional neural network, once an appropriate loss function L is defined:

- Node classification: By reducing any of the traditional losses for classification activities, like categorical cross-entropy when several classes are present:

L(*y**v*,*y**v*^)=−*c*∑*y**vc*log*y**vc*^.

Where *y**vc*^ is the forecasted probability that node v is in class c. GNNs adapt good to the semi-supervised setting, which is when just some nodes within the graph are labelled. In this setting, one method to define a loss *L _{G }*over an input graph

*G*is:

Where, we only compute losses over labelled nodes Lab(G)

- Graph classification: Through aggregation of node representations, one can build a vector representation of the entire graph. This graph representation can be leveraged for any graph-level activity, even beyond classification. See Pooling for how representations of graphs can be built.
- Link forecasting: Through sampling pairs of adjacent and non-adjacent nodes, and leverage these vector pairs as inputs to forecast the existence/absence of an edge. For a robust instance, by reducing the following ‘logistic regression’-like loss:

L(*y**v*,*y**u*,*e**vu*)*p**vu*=−*e**vu*log(*p**vu*)−(1−*e**vu*)log(1−*p**vu*)=*σ*(*y**vT**y**u*)

Where *σ *is the sigmoid function, and e_{vu} = 1 if there is an edge between nodes v and u, being 0 otherwise.

- Node clustering: By merely clustering the learned node representations.

The wide success of pre-training for natural language processing models like ELMo and BERT has sparked fascination in similar strategies for GNNs. The critical idea in each of these papers is to train GNNs to forecast local (for example node degrees, clustering coefficient, masked node attributes) and/or global graph properties. (for example pairwise distances, masked global attributes)

Another self-supervised strategy is to enforce that neighbouring nodes get similar embeddings, emulating random-walk strategies like node2vec and DeepWalk:

Where *N _{R}(v) *is a multi-set of nodes visited when arbitrary walks are commenced from v. For large graphs, where computing the total over all nodes might be computationally expensive, strategies like Noise Contrastive Estimation are particularly useful.

**Conclusion/Further Reading**

While we have observed several strategies and ideas in this blog post, the domain of Graph Neural Networks is really vast. We have been forced to limit our discussion to a minimal subset of the total literature, while still communicating the critical ideas and design principles behind GNNs. We suggest that the fascinated reader take a look at for a more comprehensive survey.

We conclude with pointers and references for extra concepts readers might be fascinated in:

**GNNs in Practice**

It happens that accommodating the differing structures of graphs is usually difficult to do efficiently, however we can still represent several GNN update equations using as sparse matrix-vector products (as generally, the adjacency matrix is sparse for most real-world graph datasets.) For instance, the GCN variant detailed here can be signified as:

Restructuring the update equations in this fashion facilitates for effective vectorized implementations of GNNs on accelerators like GPUs.

Regularization strategies for the traditional neural networks, like Dropout, can be applied in a straightforward fashion to the parameters (for instance, zero out entire rows of *W ^{(k)} *above). However, there are graph-particular strategies like DropEdge that removes complete edges arbitrarily from the graph, that also enhance the performance of several GNN models.

**Differing Variants of Graphs**

Here, we have concentrated on undirected graphs, to prevent getting into too many unneeded details. But, there are some simple variants of spatial convolutions for:

- Directed graphs: Aggregate across in neighborhood and/or out-neighbourhood features.
- Temporal graphs: Aggregate across prior and/or future node features.
- Heterogenous graphs: Learn differing aggregation functions for every node/edge type.

There do exist more advanced strategies that can reap benefits of the differing structures of these graphs: see for more discussion.

**Pooling**

This article spoke about how GNNs compute beneficial representations of nodes. However, what if we wished to compute representations of graphs for graph-level tasks (for instance, forecasting the toxicity of a molecule)?

A simple solution is to merely aggregate the final node embeddings and pass them through another neural network PREDICT G:

h_{G = }PREDICT_{G} (AGG* v*∈*G* ({*h**v*}))

But, there do exist some potent strategies for ‘pooling’ together node representations:

- SortPool: Sort vertices of the graph to obtain a fixed-size node-order invariant representation of the graph, and then apply any traditional neural network architecture.
- DiffPool: Learn to cluster vertices, develop a coarser graph over clusters rather than nodes, than apply a GNN over the coarser graph. Repeat this process till only a single cluster remains.
- SAGPool: Apply a GNN to learn node scores, then keep just the nodes with the top scores, throwing away the rest. Repeat till only a single node remains.