Skip to main content

DeepType: Multilingual Entity Linking by Neural Type System Evolution

Jonathan Raiman


San Francisco, California

Olivier Raiman


Paris, France


The wealth of structured (e.g. Wikidata) and unstructured data about the world available today presents an incredible opportunity for tomorrow’s Artificial Intelligence. So far, integration of these two different modalities is a difficult process, involving many decisions concerning how best to represent the information so that it will be captured or useful, and hand-labeling large amounts of data. DeepType overcomes this challenge by explicitly integrating symbolic information into the reasoning process of a neural network with a type system. First we construct a type system, and second, we use it to constrain the outputs of a neural network to respect the symbolic structure. We achieve this by reformulating the de- sign problem into a mixed integer problem: create a type system and subsequently train a neural network with it. In this reformulation discrete variables select which parent-child relations from an ontology are types within the type system, while continuous variables control a classifier fit to the type system. The original problem cannot be solved exactly, so we propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters. We apply DeepType to the problem of Entity Linking on three standard datasets (i.e. WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings, while explicitly using symbolic information lets it integrate new entities without retraining.

1        Introduction

Online encyclopedias, knowledge bases, ontologies (e.g. Wikipedia, Wikidata, Wordnet), alongside image and video datasets with their associated label and category hierarchies (e.g. Imagenet (Deng et al. 2009), Youtube-8M (Abu-El- Haija et al. 2016), Kinetics (Kay et al. 2017)) offer an un- precedented opportunity for incorporating symbolic representations within distributed and neural representations in Artificial Intelligence systems. Several approaches exist for integrating rich symbolic structures within the behavior of neural networks: a label hierarchy aware loss function that relies on the ultrametric tree distance between labels (e.g. 2017), a loss function that trades off specificity for accuracy by incorporating hypo/hypernymy relations (Deng et al. 2012), using NER types to constrain the behavior of an Entity Linking system (Ling, Singh, and Weld 2015), or more recently integrating explicit type constraints within a decoder’s grammar for neural semantic parsing (Krishnamurthy, Dasigi, and Gardner 2017). However, current approaches face several difficulties:

  • Selection of the right symbolic information based on the utility or information gain for a target task.
  • Design of the representation for symbolic information (hierarchy, grammar, constraints).
  • Hand-labelling large amounts of data.

DeepType overcomes these difficulties by explicitly integrating symbolic information into the reasoning process of a neural network with a type system that is automatically designed without human effort for a target task. We achieve this by reformulating the design problem into a mixed integer problem: create a type system by selecting roots and edges from an ontology that serve as types in a type system, and subsequently train a neural network with it. The original problem cannot be solved exactly, so we propose a 2-step algorithm:

  1. heuristic search or stochastic optimization over the discrete variable assignments controlling type system design, using an Oracle and a Learnability heuristic to ensure that design decisions will be easy to learn by a neural network, and will provide improvements on the target task,
  2. gradient descent to fit classifier parameters to predict the behavior of the type system.

In order to validate the benefits of our approach, we focus on applying DeepType to Entity Linking (EL), the task of resolving ambiguous mentions of entities to their referent entities in a knowledge base (KB) (e.g. Wikipedia). Specifically we compare our results to state of the art systems on three standard datasets (WikiDisamb30, CoNLL (YAGO), TAC KBP 2010). We verify whether our approach can work in multiple languages, and whether optimization of the type system for a particular language generalizes to other languages1 by training our full system in a monolingual (English) and bilingual setup (English and French), and also evaluate our Oracle (performance upper bound) on German and Spanish test datasets. We compare stochastic optimization and heuristic search to solve our mixed integer problem by comparing the final performance of systems whose type systems came from different search methodologies. We also investigate whether symbolic information is captured by using DeepType as pretraining for Named Entity Recognition (NER) on two standard datasets (i.e. CoNLL 2003 (Sang and Meulder 2003), OntoNotes 5.0 (CoNLL 2012) (Pradhan et al. 2012)).

Our key contributions in this work are as follows:

  • A system for integrating symbolic knowledge into the reasoning process of a neural network through a type system, to constrain the behavior to respect the desired symbolic structure, and automatically design the type system with- out human effort.
  • An approach to EL that uses type constraints, reduces disambiguation resolution complexity from O(N 2) to O(N ), incorporates new entities into the system with- out retraining, and outperforms all existing solutions by a wide margin.

We release code for designing, evolving, and training neural type systems2. Moreover, we observe that disambiguation accuracy reaches 99.0% on CoNLL (YAGO) and 98.6% on TAC KBP 2010 when entity types are predicted by an Oracle, suggesting that EL would be almost solved if we can improve type prediction accuracy.

The rest of this paper is structured as follows. In Section 2 we introduce EL and EL with Types, in Section 3 we describe DeepType for EL, In Section 4 we provide experimental results for DeepType applied to EL and evidence of cross-lingual and cross-domain transfer of the representation learned by a DeepType system. In Section 5 we relate our work to existing approaches. Conclusions and directions for future work are given in Section 6.

2        Task

Before we define how DeepType can be used to constrain the outputs of a neural network using a type system, we will first define the goal task of Entity Linking.

Entity Linking The goal is to recover the ground truth entities in a KB referred to in a document by locating mentions (text spans), and for each mention properly disambiguating the referent entity. Commonly, a lookup table that maps each mention to a proposal set of entities for each mention m: Em = {e1, . . . , en} (e.g. “Washington” could mean Washington, D.C. or George Washington). Disambiguation is finding for each mention m the a ground truth entity eGT in Em. Typically, disambiguation operates according to two criteria: in a large corpus, how often does a mention point to an entity, LinkCount(m, e), and how often does entity e1 cooccur with entity e2, an O(N2) process, often named coherence (Milne and Witten 2008; Ferragina and Scaiella 2010; Yamada et al. 2016).

Entity Linking with Types

In this work we extend the EL task to associate with each entity a series of types (e.g. Person, Place, etc.) that if known, would rule out invalid answers, and therefore ease linking (e.g. the context now enables types to disambiguate “Washington”). Knowledge of the types T associated with a mention can also help prune entities from the the proposal set, to produce a constrained set: Em,T ⊆ Em. In a probabilistic setting it is also possible to rank an entity e in document x according to its likelihood under the type system prediction and under the entity model:

P(e|x) ∝ Ptype(types(e)|x) · Pentity(e|x, types(e))(1)

In prior work, the 112 FIGER Types (Ling and Weld 2012) were associated with entities to combine an NER tagger with an EL system (Ling, Singh, and Weld 2015). In their work, they found that regular NER types were unhelpful, while finer grain FIGER types improved system performance.

3        DeepType for Entity Linking

DeepType is a strategy for integrating symbolic knowledge into the reasoning process of a neural network through a type system. When we apply this technique to EL, we constrain the behavior of an entity prediction model to respect the symbolic structure defined by types. As an example, when we attempt to disambiguate “Jaguar” the benefits of this approach are apparent: our decision can be based on whether the predicted type is Animal or Road Vehicle as shown visually in Figure 1.

In this section, we will first define key terminology, then explain the model and its sub-components separately.


Relation Given some knowledge graph or feature set, a relation is a set of inheritance rules that define membership or exclusion from a particular group. For instance the relation instance of(city) selects all children of the root city connected by instance of as members of the group, depicted by outlined boxes in Figure 2.

Type In this work a type is a label defined by a relation (e.g. IsHuman is the type applied to all children of Human connected by instance of).

Type Axis  A set of mutually exclusive types (e.g. IsHuman ∧ IsPlant = {}).

Type System A grouping of type axes, A, along with a type labelling function: {t , . . . , t } = TypeLabeler(e, A). For instance a type system with two axes {IsA, Topic} assigns to George Washington: {Person, Politics}, and to Washington, D.C.: {Place, Geography}).


To construct an EL system that uses type constraints we re- quire: a type system, the associated type classifier, and a

Figure 1: Example model output: “jaguar” refers to different entities depending on context. Predicting the type associated with each word (e.g. animal, region, etc.) helps eliminate options that do not match, and recover the true entity. Bar charts give the system’s belief over the type-axis “IsA”, and the table shows how types affects entity probabilities given by Wikipedia links.
Figure 2: Defining group membership with a knowledge graph relation: children of root (city) via edge (instance of).

model for predicting and ranking entities given a mention. Instead of assuming we receive a type system, classifier, entity prediction model, we will instead create the type system and its classifier starting from a given entity prediction model and ontology with text snippets containing entity mentions (e.g. Wikidata and Wikipedia). For simplicity we use LinkCount(e, m) as our entity prediction model.

We restrict the types in our type systems to use a set of parent-child relations over the ontology in Wikipedia and Wikidata, where each type axis has a root node r and an edge type g, that sets membership or exclusion from the axis (e.g. r = human, e = instance of, splits entities into: human vs. non-human3 ).

We then reformulate the problem into a mixed integer problem, where discrete variables control which roots r1, . . . , rk and edge types g1, . . . , gk among all roots R and edge types G will define type axes, while the continuous variables θ parametrize a classifier fit to the type system. Our goal in type system design is to select parent-child relations that a classifier easily predicts, and where the types improve disambiguation accuracy.


To formally define our mixed integer problem, let us first denote A as the assignment for the discrete variables that define our type system (i.e. boolean variables defining if a parent-child relation gets included in our type system), θ as the parameters for our entity prediction model and type classifier, and Smode(A,θ) as the disambiguation accuracy given a test corpus containing mentions M = {(m,e0GT,Em0)}. We now assume our model produces some score for each proposed entity e given a mention m in a document D, defined EntityScore(e, m, D, A, θ). The predicted entity for a given mention is thus: e = argmaxe∈Em EntityScore(e, m, D, A, θ). If e= e GT, the mention is disambiguated. Our problem is thus defined as:

This original formulation cannot be solved exactly4. To make this problem tractable we propose a 2-step algorithm:

  1. Discrete Optimization of Type System: Heuristic search or stochastic optimization over the discrete variables of the type system, A, informed by a Learnability heuristic and an Oracle.
  2. Type classifier: Gradient descent over continuous variables θ to fit type classifier and entity prediction model.

We will now explain in more detail discrete optimization of a type system, our heuristics (Oracle and Learnability heuristic), the type classifier, and inference in this model.

Discrete Optimization of a Type System

The original objective Smodel(A, θ) cannot be solved exactly, thus we rely on heuristic search or stochastic optimization to find suitable assignments for A. To avoid training an entire type classifier and entity prediction model for each evaluation of the objective function, we instead use a proxy objective J for the discrete optimization5. To ensure that maximizing J (A) also maximizes Smodel(A, θ), we introduce a Learnability heuristic and an Oracle that quantify the disambiguation power of a proposed type system, an estimate of how learnable the type axes in the selected solution will be. We measure an upper bound for the disambiguation power by measuring disambiguation accuracy Soracle for a type classifier Oracle over a test corpus.

To ensure that the additional disambiguation power of a solution A translates in practice we weigh by an estimate of solution’s learnability Learnability(A) improvements between Soracle and the accuracy of a system that predicts only according to the entity prediction model6, Sgreedy. according to the entity prediction model6 , Sgreedy. Selecting a large number of type axes will provide strong disambiguation power, but may lead to degenerate solutions that are harder to train, slow down inference, and lack higher-level concepts that provide similar accuracy with less axes. We prevent this by adding a per type axis penalty of λ. Combining these three terms gives us the equation for J:

Oracle Our Oracle is a methodology for abstracting away machine learning performance from the underlying representational power of a type system A. It operates on a test corpus with a set of mentions, entities, and proposal sets: mi, eGT, Em . The Oracle prunes each proposal set to only contain entities whose types match those of eGTi, yielding Em,oracle. Types fully disambiguate when |Em,oracle| = 1, otherwise we use the entity prediction model to select the right entity in the remainder set Emi,oracle:

If Oracle(m) = e GT, the mention is disambiguated. Oracle accuracy is denoted Soracle given a type system over a test corpus containing mentions M ={(m0, eGT0, Em0), . . . ,(mn, eGTn , Emn)}:

Learnability To ensure that disambiguation gains obtained during the discrete optimization are available when we train our type classifier, we want to ensure that the types selected are easy to predict. The Learnability heuristic empirically measures the average performance of classifiers at predicting the presence of a type within some Learnability specific training set.
To efficiently estimate Learnability for a full type system we make an independence assumption and model it as the mean of the Learnability for each individual axis, ignoring positive or negative transfer effects between different type axes. This assumption lets us parallelize training of

Figure 3: Text window classifier in (a) serves as type Learnability estimator, while the network in (b) takes longer to train, but discovers long-term dependencies to predict types and jointly produces a distribution for multiple type axes.

simpler classifiers for each type axis. We measure the area under its receiver operating characteristics curve (AUC) for each classifier and compute the type system’s learnability: Learnability

We use a text window classifier trained over windows of 10 words before and after a mention. Words are represented with randomly initialized word embeddings; the classifier is illustrated in Figure 3a. AUC is averaged over 4 training runs for each type axis.

Type Classifier

After the discrete optimization has completed we now have a type system A. We can now use this type system to label data in multiple languages from text snippets associated with the ontology7, and supervize a Type classifier.
The goal for this classifier is to discover long-term dependencies in the input data that let it reliably predict types across many contexts and languages. For this reason we select a bidirectional-LSTM (Lample et al. 2016) with word, prefix, and suffix embeddings as done in (Andor et al. 2016). Our network is shown pictorially in Figure 3b. Our classifier is trained to minimize the negative log likelihood of the per-token types for each type axis in the document D with L tokens:

When using Wikipedia as our source of text snippets our label supervision is partial, so we make a conditional independence assumption about our predictions and use Softmax as our output activation:


At inference-time we incorporate classifier belief into our decision process by first running it over the full context and obtaining a belief over each type axis for each in- put word w0, . . . , wL. For each mention m covering words wx, . . . , wy, we obtain the type conditional probability for all type axes i: {Pi(·|wx, D), . . . , Pi(·|wy, D)}. In multi- word mentions we must combine beliefs over multiple to- kens x . . . y: the product of the beliefs over the mention’s tokens is correct but numerically unstable and slightly less performant than max-over-time9, which we denote for the i-th type axis: Pi,(·|m, D).

The score se,m,D,A = EntityScore(e, m, D, A, θ) of an entity e given these conditional probability distributions P1,(·|m, D), . . . , Pk,(·|m, D), and the entities’ types in each axis t1, . . . , tk can then be combined to rank entities according to how predicted they were by both the entity pre- diction model and the type system. The chosen entity e for a mention m is chosen by taking the option that maxi- mizes the score among the Em possible entities; the equation for scoring and e is given below, with PLink(e|m) = LinkCount(m,e)/Ej∈Em LinkCount(m,j), αi a per type axis smoothing parameter, β is a smoothing parameter over all types:

4        Results

Type System Discovery

In the following experiments we evaluate the behavior of different search methodologies for type system discovery: which method best scales to large numbers of types, achieves high accuracy on the target EL task, and whether the choice of search impacts learnability by a classifier or generalisability to held-out EL datasets.

For the following experiments we optimize DeepType’s type system over a held-out set of 1000 randomly sampled articles taken from the Feb. 2017 English Wikipedia dump, with the Learnability heuristic text window classifiers trained only on those articles. The type classifier is trained jointly on English and French articles, totalling 800 million tokens for training, 1 million tokens for validation, sampled equally from either language.

We restrict roots R and edges G to the most common 1.5·105 entities that are entity parents through wikipedia category or instance of edges, and eliminate type axes where Learnability(·) is 0, leaving 53,626 type axes.

Human Type System Baseline To isolate discrete optimization from system performance and gain perspective on the difficulty and nature of the type system design we in- corporate a human-designed type system. Human designers have access to the full set of entities and relations in Wikipedia and Wikidata, and compose different inheritance rules through Boolean algebra to obtain higher level concepts (e.g. woman = IsHuman ∧ IsFemale, or animal = IsTaxon∧¬{IsHuman∨IsPlant}10). The final human system uses 5 type axes11, and 1218 inheritance rules.

Search methodologies

Beam Search and Greedy selection We iteratively con- struct a type system by choosing among all remaining type axes and evaluating whether the inclusion of a new type axis improves our objective: J (A ∪ {tj}) > J (A). We use a beam size of b and stop the search when all solutions stop growing.

Cross-Entropy Method  (CEM) (Rubinstein 1999) is a stochastic optimization procedure applicable to the selection of types. We begin with a probability vector P0 set to pstart, and at each iteration we sample MCEM vectors s from the Bernoulli distribution given by Pi, and measure each sample’s fitness with Eq. 3. The NCEM highest fitness elements are our winning population St at iteration t. Our probabilities are fit to St giving

Genetic Algorithm The best subset of type axes can be found by representing type axes as genes carried by Npopulation individuals in a population undergoing mutations and crossovers (Harvey 2009) over G generations. We select individuals using Eq. 3 as our fitness function.

Search Methodology Performance Impact To validate that λ controls type system size, and find the best tradeoff between size and accuracy, we experiment with a range of values and find that accuracy grows more slowly below 0.00007, while system size still increases.

From this point on we keep λ = 0.00007, and we compare the number of iterations needed by different search methods to converge, against two baselines: the empty set and the mean performance of 100 randomly sampled sets of 128 types (Table 1a). We observe that the performance of stochastic optimizers GA and CEM is similar to heuristic search, but requires orders of magnitude less function evaluations.

Next, we compare the behavior of the different search methods to a human designed system and state of the art approaches on three standard datasets (i.e. WIKI-DISAMB30 (WKD30) (Ferragina and Scaiella 2010) 12, CoNLL(YAGO) (Hoffart et al. 2011), and TAC KBP 2010 (Ji et al. 2010)), along with test sets built by randomly sampling 1000 articles from Wikipedia’s February 2017 dump in English, French, German, and Spanish which were excluded from training the classifiers. Table 1c has Oracle performance for the different search methods on the test sets, where we report disambiguation accuracy per annotation. A LinkCount baseline is included that selects the mention’s most frequently linked entity13. All search techniques’ Oracle ac

Table 1: Method comparisons. Highest value in bold, excluding oracles.

curacy significantly improve over LinkCount, and achieve near perfect accuracy on all datasets (97-99%); furthermore we notice that performance between the held-out Wikipedia sets and standard datasets sets is similar, supporting the claim that the discovered type systems generalize well. We note that machine discovered type systems outperform hu- man designed systems: CEM beats the human type system on English Wikipedia, and all search method’s type systems outperform human systems on WIKI-DISAMB30, CoNLL(YAGO), and TAC KBP 2010.

Search Methodology Learnability Impact To under- stand whether the type systems produced by different search methods can be trained similarly well we compare the type system built by GA, CEM, greedy, and the one constructed manually. EL Disambiguation accuracy is shown in Table 1c, where we compare with recent deep-learning based approaches (Globerson et al. 2016), or recent work by Ya- mada et al. for embedding word and entities (Yamada et al. 2016), or documents and entities (Yamada et al. 2017), along with count and coherence based techniques Tagme (Ferragina and Scaiella 2010) and Milne & Witten (Milne and Witten 2008). To obtain Tagme’s Feb. 2017 Wikipedia accuracy we query the public web API14 available in German and English, while other methods can be compared on CoNLL(YAGO) and TAC KBP 2010. Models trained on a human type system outperform all previous approaches to entity linking, while type systems discovered by machines lead to even higher performance on all datasets except English Wikipedia.

Cross-Lingual Transfer

Type systems are defined over Wikidata/Wikipedia, a multi- lingual knowledge base/encyclopaedia, thus type axes are language independent and can produce cross-lingual super- vision. To verify whether this cross-lingual ability is useful we train a type system on an English dataset and verify whether it can successfully supervize French data. We also measure using the Oracle (performance upper bound) whether the type system is useful in Spanish or German. Oracle performance across multiple languages does not appear to degrade when transferring to other languages (Table 1c). We also notice that training in French with an English type system still yields improvements over LinkCount for CEM, greedy, and human systems.

Because multi-lingual training might oversubscribe the model, we verified if monolingual would outperform bilingual training: we compare GA in English + French with only English (last row of Table 1c). Bilingual training does not appear to hurt, and might in fact be helpful.

We follow-up by inspecting whether the bilingual word vector space led to shared representations: common nouns have their English-French translation close-by, while proper nouns do not (French and US politicians cluster separately).

Named Entity Recognition Transfer

The goal of our NER experiment is to verify whether Deep- Type produces a type sensitive language representation useful for transfer to other downstream tasks. To measure this we pre-train a type classifier with a character-CNN and word embeddings as inputs, following (Kim et al. 2015), and replace the output layer with a linear-chain CRF (Lample et al. 2016) to fine-tune to NER data. Our model’s F1 scores when transferring to the CoNLL 2003 NER task and OntoNotes 5.0 (CoNLL 2012) split are given in Table 1b. We com- pare with two baselines that share the architecture but are not pre-trained, along with the current state of the art (Chiu and Nichols 2015).

We see positive transfer on Ontonotes and CoNLL: our baseline Bi-LSTM strongly outperforms (Chiu and Nichols 2015)’s baseline, while pre-training gives an additional 3-4 F1 points, with our best model outperforming the state of the art on the OntoNotes development split. While our base- line LSTM-CRF performs better than in the literature, our strongest baseline (CNN+LSTM+CRF) does not match the state of the art with a lexicon. We find that DeepType al- ways improves over baselines and partially recovers lexicon performance gains, but does not fully replace lexicons.

5        Related Work

Neural Network Reasoning with Symbolic structures Several approaches exist for incorporating symbolic structures into the reasoning process of a neural network by de- signing a loss function that is defined with a label hierarchy. In particular the work of (Deng et al. 2012) trades off specificity for accuracy, by leveraging the hyper/hyponymy relation to make a model aware of different granularity levels. Our work differs from this approach in that we design our type system within an ontology to meet specific accu- racy goals, while they make the accuracy/specificity tradeoff at training time, with a fixed structure. More recently (Wu, Tygert, and LeCun 2017) use a hierarchical loss to increase the penalty for distant branches of a label hierarchy using the ultrametric tree distance. We also aim to capture the most important aspects of the symbolic structure and shape our loss function accordingly, however our loss shaping is a result of discrete optimization and incorporates a learnability heuristic to choose aspects that can easily be acquired.

A different direction for integrating structure stems from constraining model outputs, or enforcing a grammar. In the work of (Ling, Singh, and Weld 2015), the authors use NER and FIGER types to ensure that an EL model follows the constraints given by types. We also use a type system and constrain our model’s output, however our type system is task-specific and designed by a machine with a disam- biguation accuracy objective, and unlike the authors we find that types improve accuracy. The work of (Krishnamurthy, Dasigi, and Gardner 2017) uses a type-aware grammar to constrain the decoding of a neural semantic parser. Our work makes use of type constraints during decoding, however the grammar and types in their system require human engineering to fit each individual semantic parsing task, while our type systems are based on online encyclopaedias and ontologies, with applications beyond EL.

Neural Entity Linking Current approaches to entity linking make extensive use of deep neural networks, distributed representations. In (Globerson et al. 2016) a neural net- work uses attention to focus on contextual entities to dis- ambiguate. While our work does not make use of attention, RNNs allow context information to affect disambiguation decisions. In the work of (Yamada et al. 2016) and (Yamada et al. 2017), the authors adopt a distributed representation of context which either models words and entities, or documents and entities such that distances between vectors in- forms disambiguation. We also rely on word and document vectors produced by RNNs, however entities are not explicitly represented in our neural network, and we use context to predict entity types, thereby allowing us to incorporate new entities without retraining.

6        Conclusion

In this work we introduce DeepType, a method for integrating symbolic knowledge into the reasoning process of a neural network. We’ve proposed a mixed integer reformulation for jointly designing type systems and training a classifier for a target task, and empirically validated that when this technique is applied to EL it is effective at integrating symbolic information in the neural network reasoning process. When pre-training with DeepType for NER, we observe improved performance over baselines and a new state of the art on the OntoNotes dev set, suggesting there is cross-domain transfer: symbolic information is incorporated in the neural network’s distributed representation. Furthermore we find that type systems designed by machines outperform those designed by humans on three benchmark datasets, which is attributable to incorporating learnability and target task performance goals within the design process. Our approach naturally enables multilingual training, and our experiments show that bilingual training improves over monolingual, and type systems optimized for English operate at similar accu- racies in French, German, and Spanish, supporting the claim that the type system optimization leads to the discovery of high level cross-lingual concepts useful for knowledge representation. We compare different search techniques, and observe that stochastic optimization has comparable performance to heuristic search, but with orders of magnitude less objective function evaluations.

The main contributions of this work are a joint formulation for designing and integrating symbolic information into neural networks, that enable us to constrain the outputs to obey symbolic structure, and an approach to EL that uses type constraints. Our approach reduces EL resolution complexity from O(N 2) to O(N ), while allowing new entities to be incorporated without retraining, and we find on three standard datasets (WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) that our approach outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system (Ling, Singh, and Weld 2015) and the more recent work by Yamada et al. for embed- ding words and entities (Yamada et al. 2016), or document and entities (Yamada et al. 2017). As a result of our experiments, we observe that disambiguation accuracy using Oracles reaches 99.0% on CoNLL (YAGO) and 98.6% on TAC KBP 2010, suggesting that EL would be almost solved if we can close the gap between type classifiers and the Oracle.

The results presented in this work suggest many directions for future research: we may test how DeepType can be applied to other problems where incorporating symbolic structure is beneficial, whether making type system design more expressive by allowing hierarchies can help close the gap between model and Oracle accuracy, and seeing if additional gains can be obtained by relaxing the classifier’s con- ditional independence assumption.

Acknowledgments We would like to thank the anonymous reviewers for their valuable feedback. In addition, we thank John Miller, Andrew Gibiansky, and Szymon Sidor for thoughtful comments and fruitful discussion.


A         Training details and hyperparameters


Our models are implemented in Tensorflow and optimized with Adam with a learning rate of 104, β1 = 0.9, β2 = 0.999, c = 108, annealed by 0.99 every 10,000 iterations.

To reduce over-fitting and make our system more robust to spelling changes we apply Dropout to input embeddings and augment our data with noise: swap input words with a special <UNK> word, remove capitalization or a trailing “s.” In our NER experiments we add Gaussian noise during training to the LSTM weights with σ = 106.

We use early stopping in our NER experiments when validation F1 score stops increasing. Type classification model selection is different as the models did not overfit, thus we instead stop training when no more improvements in F1 are observed on held-out type-training data (∼ 3 days on one Titan X Pascal).


Character representation Our character-convolutions have          character filters      with        (width,    channels): {(1, 50), (2, 75), (3, 75), (4, 100), (5, 200), (6, 200), (7, 200)}, a maximum word length of 40, and 15-dimensional char- acter embeddings followed by 2 highway layers. We learn 6-dimensional embeddings for 2 and 3 character prefixes and suffixes.

Table 2: Hyperparameters for type system discovery search.

Greedy Beam Searchb b1 8
      CEMMCEM pstart NCEM1000  50  ≈ 0.00115 |R| 200
mutation probability0.5
crossover probability0.2

Text Window Classifier The text window classifiers have 5-dimensional word embeddings, and use Dropout of 0.5. Empirically we find that two passes through the dataset with a batch size of 128 is sufficient for the window classifiers to converge. Additionally we train multiple type axes in a single batch, reaching a training speed of 2.5 type axes/second.

B         Wikipedia Link Simplification

Link statistics collected on large corpuses of entity mentions are extensively used in entity linking. These statistics pro- vide a noisy estimate of the conditional probability of an entity e for a mention m P(e|m). Intra-wiki links in Wikipedia provide a multilingual and broad coverage source of links, however annotators often create link anaphoras: “king” → Charles I of England. This behavior increases polysemy (“king” mention has 974 associated entities) and distorts link frequencies (“queen” links to the band Queen 4920 times, Elizabeth II 1430 times, and monarch only 32 times).

Problems with link sparsity or anaphora were previously identified, however present solutions rely on pruning rare links and thus lose track of the original statistics (Ferrag- ina and Scaiella 2010; Hasibi, Balog, and Bratsberg 2016; Ling, Singh, and Weld 2015). We propose instead to detect anaphoras and recover the generic meaning through the Wikidata property graph: if a mention points to entities A and B, with A being more linked than B, and A is B’s parent in the Wikidata property graph, then re- place B with A. We define A to be the parent of B if they connect through a sequence of Wikidata properties {instance of, subclass of, is a list of}, or through a single edge in {occupation, position held, series16}. The simplification process is repeated until no more updates occur. This transformation reduces the number of associated entities for each mention (“king” senses drop from 974 to 143) and ensures that the semantics

Table 3: Link change statistics per iteration during English Wikipedia Anaphora Simplification.

of multiple specific links are aggregated (number of “queen” links to monarch increase from 32 to 3553).

Figure 4: Mention Polysemy change after simplification.

After simplification we find that the mean number of senses attached to polysemous mentions drops from 4.73 to 3.93, while over 10,670,910 links undergo changes in this process (Figure 4). Table 3 indicates that most changes result from mentions containing entities and their immediate parents. This simplification method strongly reduces the number of entities tied to each Wikipedia mention in an automatic fashion across multiple languages.

C Multilingual Training Representation

Multilingual data creation is a side-effect of the ontologybased automatic labeling scheme. In Table 4 we present nearest-neighbor words for words in multiple languages. We note that common words (he, Argentinian, hypothesis) remain close to their foreign language counterpart, while proper nouns group with country/language-specific terms. We hypothesize that common words, by not fulfilling a role as a label, can therefore operate in a language independent way to inform the context of types, while proper nouns will have different type requirements based on their labels, and thus will not converge to the same representation.

D Effect of System Size Penalty

We measure the effect of varying λ on type system discovery when using CEM for our search. The effect averaged on 10 trials for a variety of λ penalties is shown in Figure 6. In particular we notice that there is a crossover point in the performance characteristics when selecting λ, where a looser penalty has diminishing returns in accuracy around λ = 10−4

Figure 5: Effect of varying λ on CEM type system discovery

E         Learnability Heuristic behavior

To better understand the behavior of the population of classifiers used to obtain AUC scores for the Learnability heuristic we investigate whether certain type axes are systematically easier or harder to predict, and summarize our results in Figure 7. We find that type axes with a instance of edge have on average higher AUC scores than type axes relying on wikipedia category. Furthermore, we also wanted to ensure that our methodology for estimating learnability was not flawed or if variance in our measurement was correlated with AUC for a type axis. We find that there is no obvious relation between the standard deviation of the AUC scores for a type axis and the AUC score itself

F         Multilingual Part of Speech Tagging

Finally the usage of multilingual data allows some amount of subjective experiments. For instance in Figure 8 we show some samples from the model trained jointly on english and french correctly detecting the meaning of the word “car” across three possible meanings.

G        Human Type System

To assist humans with the design of the system, the rules are built interactively in a REPL, and execute over the 24 million entities in under 10 seconds, allowing for real time feedback in the form of statistics or error analysis over an evaluation corpus. On the evaluation corpus, disambiguation mistakes can be grouped according to the ground truth type, allowing a per type error analysis to easily detect areas where more granularity would help. Shown below are the 5 different type axes designed by humans.


[Abu-El-Haija et al. 2016] Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; and Vi- jayanarasimhan, S. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

[Andor et al. 2016] Andor, D.; Alberti, C.; Weiss, D.; Sev- eryn, A.; Presta, A.; Ganchev, K.; Petrov, S.; and Collins,

M. 2016. Globally normalized transition-based neural net- works. arXiv preprint arXiv:1603.06042.

Table 4: Top-k Nearest neighbors (cosine distance) in shared English-French word vector space.

1argentin (0.259)he (0.333)Bayron (0.395)Reagan (0.413)paradox (0.388)
2Argentina (0.313)il (0.360)Peillon (0.409)Trump (0.441)Hypothesis (0.459)
3Argentine (0.315)him (0.398)Montebourg (0.419)Cheney (0.495)hypothe`se (0.497)

Table 5: Additional set of Top-k Nearest neighbors (cosine distance) in shared English-French word vector space.

1killing (0.585)Computer (0.384)
2terrible (0.601)computers (0.446)
3beings (0.618)informatique (0.457)

[Chiu and Nichols 2015] Chiu, J. P., and Nichols, E. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308.

[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hi- erarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255. IEEE.

[Deng et al. 2012] Deng, J.; Krause, J.; Berg, A. C.; and Fei- Fei, L. 2012. Hedging your bets: Optimizing accuracy- specificity trade-offs in large scale visual recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 3450–3457. IEEE.

[Ferragina and Scaiella 2010] Ferragina, P., and Scaiella, U. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM in- ternational conference on Information and knowledge man- agement, 1625–1628. ACM.

[Globerson et al. 2016] Globerson, A.; Lazic, N.; Chakrabarti, S.; Subramanya, A.; Ringaard, M.; and Pereira, F. 2016. Collective entity resolution with multi- focal attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 621–631.

[Harvey 2009] Harvey, I. 2009. The microbial genetic algo- rithm. In European Conference on Artificial Life, 126–133. Springer.

[Hasibi, Balog, and Bratsberg 2016] Hasibi, F.; Balog, K.; and Bratsberg, S. E. 2016. On the reproducibility of the tagme entity linking system. In European Conference on Information Retrieval, 436–449. Springer.

[Hoffart et al. 2011] Hoffart, J.; Yosef, M. A.; Bordino, I.; Fu¨rstenau, H.; Pinkal, M.; Spaniol, M.; Taneva, B.; Thater, S.; and Weikum, G. 2011. Robust Disambiguation of Named Entities in Text. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, 782–792.

[Ji et al. 2010] Ji, H.; Grishman, R.; Dang, H. T.; Griffitt, K.; and Ellis, J. 2010. Overview of the tac 2010 knowledge base population track. In Third Text Analysis Conference (TAC 2010), volume 3, 3–3.

[Kay et al. 2017] Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

[Kim et al. 2015] Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2015. Character-aware neural language models. arXiv preprint arXiv:1508.06615.

[Krishnamurthy, Dasigi, and Gardner 2017] Krishnamurthy, J.; Dasigi, P.; and Gardner, M. 2017. Neural semantic parsing with type constraints for semi-structured tables. In EMNLP, volume 17, 1532–1543.

[Lample et al. 2016] Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural a chitectures for named entity recognition. arXiv preprint arXiv:1603.01360.

[Ling and Weld 2012] Ling, X., and Weld, D. S. 2012. Fine- grained entity recognition. In AAAI. Citeseer.

[Ling, Singh, and Weld 2015] Ling, X.; Singh, S.; and Weld, D. S. 2015. Design challenges for entity linking. Trans- actions of the Association for Computational Linguistics 3:315–328.

[Milne and Witten 2008] Milne, D., and Witten, I. H. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge manage- ment, 509–518. ACM.

[Pradhan et al. 2012] Pradhan, S.; Moschitti, A.; Xue, N.; Uryupina, O.; and Zhang, Y. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In EMNLP-CoNLL Shared Task.

[Rubinstein 1999] Rubinstein, R. 1999. The cross- entropy method for combinatorial and continuous optimiza- tion. Methodology and computing in applied probability 1(2):127–190.

[Sang and Meulder 2003] Sang, E. F. T. K., and Meulder, F. D. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.

[Wu, Tygert, and LeCun 2017] Wu, C.; Tygert, M.; and Le-

Figure 6: Effect of varying λ on CEM type system discovery: Solution size (a) and iterations to convergence (b) grow exponentially with penalty decrease, while accuracy plateaus (c) around λ = 10−4. Objective function increases as penalty decreases, since solution size is less penalized (d). Standard deviation is shown as the red region around the mean.
Figure 7: Most instance of type-axes have higher AUC scores than wikipedia categories (a). The standard deviation for AUC scoring with text window classifiers is below 0.1 (b), AUC is not correlated with AUC’s standard deviation.
This is     a  car  , ceci n’  est pas une voiture car c’  est un car  .
Figure 8: Model trained jointly on monolingual POS corpora detecting the multiple meanings of “car” (shown in bold) in a mixed English-French sentence.

Cun, Y. 2017. Hierarchical loss for classification. arXiv preprint arXiv:1709.01062.

[Yamada et al. 2016] Yamada, I.; Shindo, H.; Takeda, H.; and Takefuji, Y. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343.

[Yamada et al. 2017] Yamada, I.; Shindo, H.; Takeda, H.; and Takefuji, Y. 2017. Learning distributed representations of texts and entities from knowledge base. arXiv preprint arXiv:1705.02494.

Table 6: Human Type Axis: IsA

Anatomical structure
Astronomical object
Audio visual work
Award ceremony Battle
Book magazine article
Chemical compound
Data format
Developmental biology period
Electromagnetic wave Event
Fictional character
Geographical object
Geometric shape
Human female
Human male
International relations

Table 7: Human Type Axis: IsA (continued)

Legal action
Legal case
Legislative term
Mathematical object
Musical work
Natural phenomenon
Other art work
Person role
Physical object
Physical quantity
Populated place
Postal code
Radio program
Record chart
Road vehicle
Sexual orientation
Sport event
Sports terminology
Taxonomic rank
Train station
Unit of mass

Table 8: Human Type Axis: IsA (continued)

Vehicle brand

Table 9: Human Type Axis: Topic

Automotive industry
Business other
Culture arts other
Culture ceramic art
Culture circus
Culture literature
Health cell
Health childbrith
Health drug
Health gene
Health hospital
Health human gene
Health insurance
Health life insurance
Health medical
Health med activism
Health med doctors
Health med society
Health organisations
Health people in health
Health pharma
Health protein

Table 10: Human Type Axis: Topic (continued)

Health protein wkp
Health science medicine
Heavy industry
Hortculture and gardening
Military war crime
Populated places
Retail other
Science other
Science-computer science
Science-social science other
Science chronology
Science histology
Science meteorology
Sex industry
Sport-american football
Sport-australian football
Sport-combat sport
Sport-cue sport
Sport-equestrian sport
Sport-field hockey

Table 11: Human Type Axis: Topic (continued)

Sport-ice hockey
Sport-mind sport
Sport-motor sport
Sport-racquet sport
Sport-winter sport
Sport water sport

Table 12: Human Type Axis: Time


Table 13: Human Type Axis: Location

Middle East
North America
Outer Space
Populated place unlocalized
South America