The More You Know: Using Knowledge Graphs for Image Classification

The More Y'all Know: Using Knowledge Graphs for Image Classification

Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta
Carnegie Mellon University
5000 Forbes Ave, Pittsburgh, PA 15213
{kdmarino, rsalakhu,

Abstract

One characteristic that sets humans apart from modernistic learning-based figurer vision algorithms is the ability to acquire noesis nigh the world and utilize that knowledge to reason nearly the visual earth. Humans can larn about the characteristics of objects and the relationships that occur betwixt them to learn a big variety of visual concepts, often with few examples. This paper investigates the apply of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on prototype classification. We build on recent work on end-to-stop learning on graphs, introducing the Graph Search Neural Network as a manner of efficiently incorporating large noesis graphs into a vision nomenclature pipeline. We show in a number of experiments that our method outperforms standard neural network baselines for multi-label nomenclature.

1 Introduction

Our world contains millions of visual concepts understood by humans. These often are ambiguous (tomatoes tin can exist reddish or light-green), overlap (vehicles includes both cars and planes) and have dozens or hundreds of subcategories (thousands of specific kinds of insects). While some visual concepts are very common such as person or car, well-nigh categories take many fewer examples, forming a long-tail distribution[37]. And yet, even when only shown a few or even 1 case, humans have the remarkable ability to recognize these categories with loftier accurateness. In contrast, while modern learning-based approaches can recognize some categories with high accurateness, it usually requires thousands of labeled examples for each of these categories. Given how big, circuitous and dynamic the space of visual concepts is, this approach of edifice large datasets for every concept is unscalable. Therefore, we demand to ask what humans have that current approaches do not.

I possible answer to this is structured knowledge and reasoning. Humans are non merely appearance-based classifiers; we gain knowledge of the world from experience and language. Nosotros use this knowledge in our everyday lives to recognize objects. For case, we might accept read in a book almost the "elephant shrew" (maybe even seen an example) and volition have gained knowledge that is useful for recognizing 1. Figure1 illustrates how we might use our knowledge about the world in this trouble. Nosotros might know that an elephant shrew looks like a mouse, has a body and a tail, is native to Africa, and is often plant in bushes. With this information, we could probably identify the elephant shrew if we saw one in the wild. We do this by first recognizing (we come across a small-scale mouse-like object with a trunk in a bush-league), recalling knowledge (we think of animals we have heard of and their parts, habitat, and characteristics) and and so reasoning (it is an elephant shrew considering it has a torso and a tail, and looks similar a mouse while mice and elephants exercise non have all these characteristics). With this information, fifty-fifty if we have only seen one or two pictures of this animal, nosotros would exist able to classify it.

Example of how semantic knowledge about the world aids classification. Here we see an elephant shrew. Humans are able to make the correct classification based on what we know about the elephant shrew and other similar animals. — Figure one: Example of how semantic cognition virtually the world aids classification. Hither we come across an elephant shrew. Humans are able to make the right classification based on what we know virtually the elephant shrew and other similar animals.

In that location has been a lot of work in end-to-end learning on graphs or neural network trained on graphs[31, 2, 6, xi, 25, 22, 9, 21]. Most of these approaches either extract features from the graph or they learn a propagation model that transfers evidence between nodes provisional on the type of edge. An case of this is the Gated Graph Neural Network[18] which takes an arbitrary graph every bit input. Given some initialization specific to the job, it learns how to propagate information and predict the output for every node in the graph. This approach has been shown to solve basic logical tasks besides as program verification.

Our work improves on this model and adapts finish-to-terminate graph neural networks to multi-label paradigm classification. We innovate the Graph Search Neural Network (GSNN) which uses features from the image to efficiently annotate the graph, select a relevant subset of the input graph and predict outputs on nodes representing visual concepts. These output states are then used to classify the objects in the image. GSNN learns a propagation model which reasons nigh unlike types of relationships and concepts to produce outputs on the nodes which are then used for image nomenclature. Our new architecture mitigates the computational issues with the Gated Graph Neural Networks for large graphs which allows our model to be efficiently trained for image tasks using large noesis graphs. We testify how our model is effective at reasoning virtually concepts to improve image classification tasks. Importantly, our GSNN model is too able to provide explanations on classifications by following how the information is propagated in the graph.

The major contributions of this work are (a) the introduction of the GSNN equally a way of incorporating potentially large cognition graphs into an cease-to-terminate learning system that is computationally feasible for large graphs; (b) a framework for using noisy knowledge graphs for image nomenclature; and (c) the power to explain our image classifications past using the propagation model. Our method significantly outperforms baselines for multi-label classification.

2 Related Work

Learning knowledge graphs[4, 3, xxx] and using graphs for visual reasoning[37, 20] has recently been of interest to the vision community. For reasoning on graphs, several approaches have been studied. For instance, [38] collects a knowledge base and then queries this knowledge base to practice first-lodge probabilistic reasoning to predict affordances. [20] builds a graph of exemplars for different categories and uses the spatial relationships to perform contextual reasoning. Approaches such as [17] utilize random walks on the graphs to learn patterns of edges while performing the walk and predict new edges in the knowledge graph. There has also been some piece of work using a knowledge base for image retrieval[12] or answering visual queries[39], but these works are focused on building and then querying noesis bases rather than using existing knowledge bases every bit side information for some vision job.

However, none of these approaches have been learned in an end-to-finish manner and the propagation model on the graph is mostly hand-crafted. More recently, learning from knowledge graphs using neural networks and other end-to-end learning systems to perform reasoning has go an agile expanse of research. Several works treat graphs as a special case of a convolutional input where, instead of pixel inputs connected to pixels in a grid, we ascertain the inputs as connected past an input graph, relying on either some global graph structure or doing some sort of pre-processing on graph edges[2, 6, xi, 25]. However, most of these approaches have been tried on smaller, cleaner graphs such equally molecular datasets. In vision problems, these graphs encode contextual and common-sense relationships and are significantly larger and noisier.

Li and Zemel nowadays Graph Gated Neural Networks (GGNN)[18] which uses neural networks on graph structured data. This paper (an extension of Graph Neural Networks[31]) serves as the foundation for our Graph Search Neural Network (GSNN). Several papers have found success using variants of Graph Neural Networks applied to various simple domains such as quantitative structure-property relationship (QSPR) analysis in chemistry[22] and subgraph matching and other graph problems on toy datasets[9]. GGNN is a fully end-to-cease network that takes every bit input a directed graph and outputs either a nomenclature over the entire graph or an output for each node. For case, for the trouble of graph reachability, GGNN is given a graph, a start node and cease node, and the GGNN will take to output whether the end node is reachable from the first node. They bear witness results for logical tasks on graphs and more complex tasks such every bit program verification.

There is also a substantial amount of work on various types of kernels defined for graphs[36] such as diffusion kernels[14], graphlet kernels[33], Weisfeiler-Lehman graph kernels[32], deep graph kernels[27], graph invariant kernels[26] and shortest-path kernels[i]. The methods have diverse ways of exploiting mutual graph structures, however, these approaches are only helpful for kernel-based approaches such as SVMs which do not compare well with neural network architectures in vision.

Our work is also related to attribute approaches[8] to vision such as [xvi] which uses a fixed set of binary attributes to do nil-shot prediction, [34] which uses attributes shared across categories to prevent semantic drift in semi-supervised learning and [five] which automatically discovers attributes and uses them for fine-grained nomenclature. Our work also uses attribute relationships that appear in our cognition graphs, but too uses relationships between objects and reasons directly on graphs rather than using object-attribute pairs directly.

3 Methodology

3.1 Graph Gated Neural Network

The idea of GGNN is that given a graph with nodes, we want to produce some output which can either be an output for every graph node or a global output . This is washed by learning a propagation model like to an LSTM. For each node in the graph , nosotros have a hidden state representation at every time step . We get-go at with initial hidden states that depends on the problem. For instance, for learning graph reachability, this might be a two fleck vector that indicates whether a node is the source or destination node. In case of visual knowledge graph reasoning, tin be a one bit activation representing the confidence of a category being present based on an object detector or classifier.

Next, nosotros employ the structure of our graph, encoded in a matrix which serves to retrieve the hidden states of adjacent nodes based on the border types between them. The hidden states are and then updated by a gated update module similar to an LSTM. The basic recurrence for this propagation network is

where is the hidden state for node at time stride , is the problem specific annotation, is the adjacency matrix of the graph for node , and and are learned parameters. Eq1 is the initialization of the hidden state with and empty dimensions. Eq2 shows the propagation updates from next nodes. Eq (three-6) combine the information from adjacent nodes and current subconscious country of the nodes to compute the side by side hidden state.

After time steps, nosotros have our terminal hidden states. The node level outputs can and so just be computed every bit

where is a fully continued network, the output network, and is the original annotation for the node.

3.2 Graph Search Neural Network

The biggest problem in adapting GGNN for prototype tasks is computational scalability. NEIL[4] for case has over 2000 concepts, and NELL[three] has over 2M confident beliefs. Fifty-fifty after pruning to our job, these graphs would even so exist huge. Frontwards propagation on the standard GGNN is to the number of nodes and astern propagation is where is the number of propagation steps. We perform simple experiments on GGNNs on synthetic graphs and notice that after more than than about 500 nodes, a forward and backward pass takes over 1 second on a single case, fifty-fifty when making generous parameter assumptions. On 2,000 nodes, it takes well over a minute for a unmarried epitome. Using GGNN out of the box is infeasible.

Our solution to this problem is the Graph Search Neural Network (GSNN). As the name might imply, the idea is that rather than performing our recurrent update over all of the nodes of the graph at once, we get-go with some initial nodes based on our input and only choose to expand nodes which are useful for the final output. Thus, we only compute the update steps over a subset of the graph. Then how do nosotros select which subset of nodes to initialize the graph with? During training and testing, nosotros determine initial nodes in the graph based on likelihood of the concept existence present as determined by an object detector or classifier. For our experiments, we use Faster R-CNN[28] for each of the fourscore COCO categories. For scores over some chosen threshold, we choose the respective nodes in the graph as our initial set of active nodes.

Once we have initial nodes, we as well add the nodes next to the initial nodes to the active set. Given our initial nodes, we want to first propagate the beliefs about our initial nodes to all of the next nodes. Afterward the showtime time step, however, we need a way of deciding which nodes to expand next. Nosotros therefore learn a per-node scoring role that estimates how "important" that node is. Subsequently each propagation step, for every node in our electric current graph, nosotros predict an importance score

where is a learned network, the importance network.

Once we accept values of , we have the summit scoring nodes that have never been expanded and add them to our expanded prepare, and add all nodes adjacent to those nodes to our active set. Figure2 illustrates this expansion. At but the detected nodes are expanded. At nosotros expand chosen nodes based on importance values and add their neighbors to the graph. At the final time step nosotros compute the per-node-output and re-order and zero-pad the outputs into the final classification net.

To train the importance net, we assign target importance value to each node in the graph for a given image. Nodes corresponding to ground-truth concepts in an image are assigned an importance value of 1. The neighbors of these nodes are assigned a value of . Nodes which are two-hop away have value and and then on. The idea is that nodes closest to the final output are the nearly important to expand.

Graph Search Neural Network expansion. Starts with detected nodes and expands neighbors. Adds nodes adjacent to expand nodes predicted by importance net. — Effigy 2: Graph Search Neural Network expansion. Starts with detected nodes and expands neighbors. Adds nodes next to expand nodes predicted by importance net.

Nosotros now have an end-to-end network which takes equally input a set of initial nodes and annotations and outputs a per-node output for each of the active nodes in the graph. It consists of three sets of networks: the propagation net, the importance net, and the output internet. The final loss from the image problem tin be backpropagated from the final output of the pipeline back through the output internet and the importance loss is backpropagated through each of the importance outputs. Meet Effigyiii to see the GSNN architecture. First , the detection confidences initialize , the hidden states of the initially detected nodes. We then initialize , the hidden states of the adjacent nodes, with . We and so update the hidden states using the propagation internet. The values of are then used to predict the importance scores , which are used to pick the next nodes to add together . These nodes are then initialized with and the hidden states are updated again through the propagation net. After steps, we then take all of the accumulated hidden states to predict the GSNN outputs for all the active nodes. During backpropagation, the binary cross entropy (BCE) loss is fed astern through the output layer, and the importance losses are fed through the importance networks to update the network parameters.

Graph Search Neural Network diagram. Shows initialization of hidden states, addition of new nodes as graph is expanded and the flow of losses through the output, propagation and importance nets. — Figure 3: Graph Search Neural Network diagram. Shows initialization of hidden states, addition of new nodes equally graph is expanded and the flow of losses through the output, propagation and importance nets.

1 final particular is the addition of a "node bias" into GSNN. In GGNN, the per-node output function takes in the hidden state and initial annotation of the node to compute its output. In a sure sense information technology is doubter to the meaning of the node. That is, at train or test time, GSNN takes in a graph it has maybe never seen before, and some initial annotations for each node. Information technology and so uses the structure of the graph to propagate those annotations through the network and then compute an output. The nodes of the graph could take represented annihilation from human relationships to a computer program. Yet, in our graph network, the fact that a particular node represents "horse" or "cat" will probably be relevant, and we tin as well constrain ourselves to a static graph over image concepts. Hence we introduce node bias terms that, for every node in our graph, has some learned values. Our output equations are now where is a bias term that is tied to a item node in the overall graph. This value is stored in a tabular array and its value are updated by backpropagation.

3.iii Prototype pipeline and baselines

Some other trouble nosotros confront adapting graph networks for vision problems is how to incorporate the graph network into an paradigm pipeline. For classification, this is fairly straightforward. We have the output of the graph network, reorder it so that nodes always announced in the aforementioned order into the terminal network, and zero pad whatever nodes that were not expanded. Therefore, if we have a graph with node outputs, and each node predicts a -dim subconscious variable, we create a -dim characteristic vector from the graph. Nosotros also concatenate this feature vector with fc7 layer ( -dim) of a fine-tuned VGG-16 network[35] and elevation-score for each COCO category predicted past Faster R-CNN ( -dim). This -dim characteristic vector is then fed into 1-layer last classification network trained with dropout.

For baselines, we compare to: (i) VGG Baseline - feed just fc7 into final nomenclature cyberspace; (2) Detection Baseline - feed fc7 and height COCO scores into final classification net.

4 Results

4.1 Datasets

For our experiments, we wanted to test on a dataset that represents the complex, noisy visual globe with its many different kinds of objects, where labels are potentially ambiguous and overlapping, and categories fall into a long-tail distribution[37]. Humans exercise well in this setting, but vision algorithms still struggle with information technology. To this cease, we chose the Visual Genome dataset[15] v1.0.

Visual Genome contains over 100,000 natural images from the Internet. Each image is labeled with objects, attributes and relationships betwixt objects entered past man annotators. Annotators could enter any object in the epitome rather than from a predefined list, and so as a consequence there are thousands of object labels with some existence more common and most having many fewer examples. There are on average 21 labeled objects in an epitome, then compared to datasets such as ImageNet[29] or PASCAL[7], the scenes we are considering are far more complex. Visual Genome is also labeled with object-object relationships and object-aspect relationships which we apply for GSNN.

In our experiments, we create a subset from Visual Genome which we phone call Visual Genome multi-label dataset or VGML. In VGML, we accept the 200 nigh common objects in the dataset and the 100 most common attributes and also add whatever COCO categories not in those 300 for a total of 316 visual concepts. Our task is then multi-label classification: for each image predict which subset of the 316 total categories appear in the scene. We randomly split the images into a roughly 80-20 train/test split. Since nosotros used pre-trained detectors from COCO, nosotros ensure none of our test images overlap with our detector'south training images.

We likewise evaluate out method on the more standard COCO dataset[19] to show that our arroyo is useful on multiple datasets and that our method does non rely on graphs built specifically for our datasets. Nosotros train and examination in the multi-label setting[24], and evaluate on the minival set[28].

4.2 Building the Knowledge Graph

Nosotros as well use Visual Genome every bit a source for our knowledge graph. Using but the train split, we build a noesis graph connecting the concepts using the well-nigh common object-attribute and object-object relationships in the dataset. Specifically, we counted how often an object/object relationship or object/attribute pair occurred in the preparation set, and pruned whatever edges that had fewer than 200 instances. This leaves us with a graph over all of the images with each edge being a mutual relationship. The thought is that we would get very common relationships (such every bit grass is green or person wears apparel) but not relationships that are rare and only occur in single images (such equally person rides zebra).

The Visual Genome graphs are useful for our problem because they incorporate scene-level relationships betwixt objects, e.g. person wears pants or burn hydrant is red and thus allow the graph network to reason about what is in a scene. However, it does non contain useful semantic relationships. For instance, information technology might exist helpful to know that dog is an beast if our visual organisation sees a dog and one of our labels is fauna. To address this, nosotros also create a version of graph by fusing the Visual Genome Graphs with WordNet[23]. Using the subset of WordNet from[10], we first collect new nodes in WordNet non in our output label by including those which directly connect to our output labels and thus probable to exist relevant and add together them to a combined graph. Nosotros so accept all of the WordNet edges betwixt these nodes and add them to our combined graph.

4.3 Training details

We jointly train all parts of the pipeline (except for the detectors). All models are trained with Stochastic Gradient Descent, except GSNN which is trained using ADAM[13]. We use an initial learning rate of , for the VGG cyberspace before , decreasing by a factor of every epochs, an L2 penalty of and a momentum of . We fix our GSNN hidden state size to , importance disbelieve factor to , number of fourth dimension steps to , initial confidence threshold to and our aggrandize number to . Our GSNN importance and output networks are single layer networks with sigmoid activations. All networks were trained for epochs with a batch size of 16.

4.4 Quantitative Evaluation

Tabular arrayane shows the result of our method on Visual Genome multi-characterization classification. In this experiment, the combined Visual Genome, WordNet graph outperforms the Visual Genome graph. This suggests that including the outside semantic knowledge from WordNet and performing explicit reasoning on a knowledge graph allows our model to learn better representations compared to the other models.

Nosotros as well perform experiments to test the effect of limiting the size of the training dataset has on operation. Figure4 shows the results of this experiment on Visual Genome, varying the training set size from the entire grooming set up (approximately 80,000), all the way downwards to 500 examples. Choosing the subsets of examples for these experiments is done randomly, just each training gear up is a subset of the larger ones—eastward.one thousand. all of the examples in the one,000 set are besides in the 2,000 set. We see that, until the 1,000 sample gear up, the GSNN-based methods all outperform baselines. At 1,000 and 500 examples, all of the methods perform equally. Given the long-tail nature of Visual Genome, information technology is probable that for fewer than 2,000 samples, many categories do not have enough examples for whatever method to learn well. This experiment indicates that our method is able to ameliorate even in the low-data example up to a point.

In Tabletwo, we bear witness results on the COCO multi-label dataset. We can see that the heave from using graph knowledge is more significant than information technology was on Visual Genome. Ane possible explanation is that the Visual Genome knowledge graph provides significant information which helps improve the operation on the COCO dataset itself. In the previous Visual Genome experiment, much of the graph information is independent in the labels and images themselves. Ane other interesting result is that the Visual Genome graph outperforms the combined graph for COCO, though both outperform baselines. Ane possible reason is that the original VGML graph is smaller, cleaner, and contains more relevant information than the combined graph. Furthermore, in the VGML experiment, WordNet is new outside information for the algorithm helping heave the performance.

One possible concern is the over dependence of the graph reasoning on the set of lxxx COCO detectors and initial detections. Therefore, we performed an ablation experiment to run into how sensitive our method is to having all of the initial detections. Nosotros reran the COCO experiments with both graphs using two different subsets of COCO detectors. The offset subset is just the even COCO categories and the second subset is simply the odd categories. We see from Table3 that GSNN methods again outperform the baselines.

Tabular array 1: Hateful Boilerplate Precision for multi-label classification on Visual Genome Multi-Label dataset. Numbers for VGG baseline, VGG baseline with detections, GSNN using Visual Genome graph and GSNN using a combined Visual Genome and WordNet graph.

Mean Average Precision on Visual Genome in the low data setting. Shows performance for all methods for the full dataset, 40,000, 20,000, 10,000, 5,000, 2,000, 1,000, and 500 training examples. — Figure 4: Mean Average Precision on Visual Genome in the low data setting. Shows functioning for all methods for the total dataset, 40,000, 20,000, 10,000, 5,000, 2,000, 1,000, and 500 training examples.

Table two: Mean Boilerplate Precision for multi-label classification on COCO. Numbers for VGG baseline, VGG baseline with detections, GSNN using Visual Genome graph and GSNN using combined Visual Genome and WordNet graph.

Table three: Mean Boilerplate Precision for multi-label classification on COCO, using only odd and fifty-fifty detectors.

Difference in Average Precision for each of the 316 labels in VGML between our GSNN combined graph model and detection baseline for the Visual Genome experiment. Top categories: scissors, donut, frisbee, microwave, fork. Bottom categories: stacked, tiled, light brown, ocean, grassy. — Figure 5: Difference in Boilerplate Precision for each of the 316 labels in VGML betwixt our GSNN combined graph model and detection baseline for the Visual Genome experiment. Pinnacle categories: scissors, donut, frisbee, microwave, fork. Bottom categories: stacked, tiled, lite brown, ocean, grassy.

Difference in Average Precision for each of the 80 labels in COCO between our GSNN VG graph model and detection baseline for the COCO experiment. Top categories: fork, donut, cup, apple, microwave. Bottom categories: hairdryer, parking meter, bear, kite, and giraffe. — Figure half-dozen: Difference in Average Precision for each of the 80 labels in COCO between our GSNN VG graph model and detection baseline for the COCO experiment. Top categories: fork, donut, cup, apple tree, microwave. Bottom categories: hairdryer, parking meter, behave, kite, and giraffe.

As one might suspect, our method does not perform uniformly on all categories, only rather does ameliorate on some categories and worse on others. Effigy5 shows the differences in boilerplate precision for each category betwixt our GSNN model with the combined graph and the detection baseline for the VGML experiment. Effigy6 shows the same for our COCO experiment. Performance on some classes improves greatly, such as "fork" in our VGML experiment and "pair of scissors" in our COCO experiment. These and other practiced results on "knife" and "toothbrush" seem to indicate that the graph reasoning helps peculiarly with small objects in the image. In the next department, nosotros analyze our GSNN models on several examples to try to gain a better intuition as to what the GSNN model is doing and why it does well or poorly on sure examples.

four.5 Qualitative Evaluation

Sensitivity analysis of GSNN in VGML experiment (left) and COCO experiment (right) with the combined graph and Visual Genome graphs respectively. Each example shows the image, part of the knowledge graph expanded during the classification, and the sensitivity values of the initial detections, and the hidden states at time steps 2 and 3 with respect to the output class listed. The top detections and hidden state nodes are printed for convenience since the x-axis is too large to list every class. The top and middle rows show the results for images and classes where the GSNN significantly outperforms the detection baseline to get an intuition for when our method is working. The bottom row shows images and classes where GSNN does worse than the detection baseline to get an idea of when our method fails and why. — Effigy 7: Sensitivity analysis of GSNN in VGML experiment (left) and COCO experiment (right) with the combined graph and Visual Genome graphs respectively. Each example shows the image, function of the noesis graph expanded during the classification, and the sensitivity values of the initial detections, and the hidden states at fourth dimension steps 2 and three with respect to the output class listed. The top detections and hidden state nodes are printed for convenience since the 10-axis is likewise large to list every class. The top and middle rows show the results for images and classes where the GSNN significantly outperforms the detection baseline to get an intuition for when our method is working. The bottom row shows images and classes where GSNN does worse than the detection baseline to get an idea of when our method fails and why.

1 way to analyse the GSNN is to look at the sensitivities of parameters in our model with respect to a particular output. Given a single image , and a unmarried label of interest that appears in the epitome, we would like to know how information travels through the GSNN and what nodes and edges it uses. We examined the sensitivity of the output to subconscious states and detections past computing the partial derivatives with respect to the category of interest. These values tell the states how a minor change in the hidden state of a particular node affects a detail output. Nosotros would await to run into, for instance, that for labeling elephant, we see a high sensitivity for the subconscious states corresponding to greyness and trunk.

In this section, we show the sensitivity analysis for the GSNN combined graph model on the VGML experiment and the Visual Genome graph on the COCO experiments. In item, nosotros examine some classes that performed well nether GSNN compared to the detection baseline and a few that performed poorly to effort to become a ameliorate intuition into why some categories meliorate more than.

Figure7 shows the graph sensitivity analysis for the experiments with VGML on the left and COCO on the right, showing four examples where GSNN does better and two where it does worse. Each example shows the epitome, the basis truth output we are analyzing and the sensitivities of the concept of interest with respect to the hidden states of the graph or detections. For convenience, nosotros brandish the names of the tiptop detections or subconscious states. Nosotros too testify part of the graph that was expanded, to run across what relationships GSNN was using.

For the VGML experiment, the peak left of Figure7 shows that using the detection for person, GSNN is able to reason that jeans are more likely since jeans are usually on people in images using the "wearing" edge. It is as well sensitive to skateboard and equus caballus, and each of these has a second social club connexion to jeans through person, so information technology is likely able to capture the fact that people tend to wear jeans while on horses and skateboards. Note that the sensitivities are non the aforementioned as the actual detections, then it is non contradictory that horse has high sensitivity. The second row on the left shows a successful case for bicycle, using detections from person and skateboard and the fact that people tend to be "on" bicycles and skateboards. The final row shows a failure case for windshield. It correctly correlates with autobus, but because the cognition graph lacks a connexion betwixt jitney and windshield, the graph network is unable to do better than the detection baseline. On the right, for the COCO experiment, the top example shows that fork is highly correlated with the detection for fork, which should not be surprising. Even so, information technology is able to reinforce this detection with the connections betwixt broccoli and dining tabular array, which are both two step connections to fork on the graph. Similarly, the centre example shows that the graph connections for pizza, bowl, and bottle being "on" dining table reinforce the detection of dining table. The bottom right shows another failure example. It is able to get the connection between the detection for toilet and hair dryer (both found in the bathroom), but the lack of adept connections in the graph prevent the GSNN from improving over the baseline.

5 Conclusion

In this paper, we present the Graph Search Neural Network (GSNN) as a mode of efficiently using cognition graphs equally extra information to improve prototype classification. We provide analysis that examines the flow of information through the GSNN and provides insights into why our model improves performance. We hope that this piece of work provides a step towards bringing symbolic reasoning into traditional feed-forward computer vision frameworks.

The GSNN and the framework we use for vision bug is completely general. Our side by side steps will be to apply the GSNN to other vision tasks, such as detection, Visual Question Answering, and image captioning. Another interesting direction would exist to combine the procedure of this work with a system such as NEIL[4] to create a system which builds knowledge graphs and then prunes them to get a more accurate, useful graph for image tasks.

Acknowledgements: Nosotros would like to thank anybody who took time to review this work and provide helpful comments. This enquiry is based upon work supported in role by the Office of the Director of National Intelligence (ODNI), Intelligence Avant-garde Research Projects Activeness (IARPA). The views and conclusions independent herein are those of the authors and should not exist interpreted as necessarily representing the official policies, either expressed or implied of ODNI, IARPA, or the U.s.a. government. The Us Authorities is authorized to reproduce and distribute the reprints for governmental purposed even so any copyright notation therein. This textile is based upon work supported past the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1252522 and ONR MURI N000141612007.

References

[1] M. 1000. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. ICDM , 2005.
[two] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally continued networks on graphs. arXiv preprint arXiv:1312.6203 , 2013.
[3] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, East. R. Hruschka, and T. M. Mitchell. Toward an architecture for never-ending language learning. AAAI , 2010.
[4] 10. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual cognition from spider web data. CVPR , 2013.
[5] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. CVPR , 2012.
[6] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. NIPS , 2015.
[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://world wide web.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[8] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. CVPR , 2009.
[nine] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. IEEE International Joint Conference on Neural Networks , 2, 2005.
[10] Yard. Guu, J. Miller, and P. Liang. Traversing knowledge graphs in vector space. In Empirical Methods in Tongue Processing (EMNLP) , 2015.
[11] 1000. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 , 2015.
[12] J. Johnson, R. Krishna, M. Stark, 50.-J. Li, D. A. Shamma, M. South. Bernstein, and L. Fei-Fei. Prototype retrieval using scene graphs. CVPR , 2015.
[13] D. P. Kingma and J. 50. Ba. Adam: A method for stochastic optimization. ICLR , 2015.
[14] R. I. Kondor and J. Lafferty. Improvidence kernels on graphs and other discrete input spaces. ICML , ii, 2002.
[15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, Grand. Hata, J. Kravitz, S. Chen, Y. Kalantidis, 50.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting linguistic communication and vision using crowdsourced dense image annotations. 2016.
[sixteen] C. H. Lampert, H. Nickisch, and S. Harmeling. Aspect-based classification for zero-shot visual object categorization. TPAMI , 2014.
[17] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. NIPS , 2011.
[xviii] Y. Li and R. Zemel. Gated graph sequence neural networks. ICLR , 2016.
[19] T. Lin, Thou. Maire, S. J. Belongie, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. ECCV , 2014.
[xx] T. Malisiewicz and A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. NIPS , 2009.
[21] Five. D. Massa, G. Monfardini, 50. Sarti, F. Scarselli, Thou. Maggini, and M. Gori. A comparison between recursive neural networks and graph neural networks. IEEE International Articulation Briefing on Neural Network Proceedings , 2006.
[22] A. Micheli. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks , 2009.
[23] G. A. Miller. Wordnet: A lexical database for english. ACM , 38, 1995.
[24] I. Misra, C. L. Zitnick, M. Mitchell, and R. Girshick. Seeing through the Man Reporting Bias: Visual Classifiers from Noisy Man-Axial Labels. In CVPR , 2016.
[25] One thousand. Niepert, One thousand. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. arXiv preprint arXiv:1605.05273 , 2016.
[26] F. Orsini, P. Frasconi, and L. D. Raedt. Graph invariant kernels. IJCAI , 2015.
[27] Pinar, Yanardag, and S. 5. North. Vishwanathan. Deep graph kernels. KDDM , 2015.
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards existent-time object detection with region proposal networks. NIPS , 2015.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Southward. Ma, Z. Huang, A. Karpathy, A. Khosla, Chiliad. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV , 115(3):211–252, 2015.
[thirty] F. Sadeghi, S. K. Divvala, and A. Farhadi. Viske: Visual cognition extraction and question answering by visual verification of relation phrases. CVPR , 2015.
[31] F. Scarselli, M. Gori, A. C. Tsoi, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks , 2009.
[32] Northward. Shervashidze, P. Schweitzer, Due east. J. van Leeuwen, Grand. Mehlhorn, and G. M. Borgwardt. Weisfeiler-lehman graph kernels. JMLR , 2011.
[33] N. Shervashidze, Due south. V. Due north. Vishwanathan, T. H. Petri, Thousand. Mehlhorn, and K. M. Borgwardt. Efficient graphlet kernels for big graph comparison. AISTATS , 5, 2009.
[34] A. Shrivastava, Due south. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. ECCV , 2012.
[35] M. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.
[36] S. V. Northward. Vishwanathan, N. Due north. Schraudolph, R. Kondor, and Yard. M. Borgwardt. Graph kernels. JMLR , 2010.
[37] X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories. CVPR , 2014.
[38] Y. Zhu, A. Fathi, and Fifty. Fei-Fei. Reasoning about Object Affordances in a Knowledge Base Representation. In European Briefing on Computer Vision , 2014.
[39] Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei. Edifice a large-scale multimodal noesis base system for answering visual queries. arXiv preprint arXiv:1507.05670 , 2015.

wernercomints.blogspot.com

Source: https://www.arxiv-vanity.com/papers/1612.04844/