1 00:00:05,961 --> 00:00:08,133 (moderator) The next talk is by Anders Sandholm 2 00:00:08,133 --> 00:00:12,319 on Wikidata fact annotation for Wikipedia across languages. 3 00:00:12,319 --> 00:00:13,920 - Thank you. - Thanks. 4 00:00:21,905 --> 00:00:24,164 I wanted to start with a small confession. 5 00:00:26,428 --> 00:00:31,687 Wow! I'm blown away by the momentum of Wikidata 6 00:00:33,799 --> 00:00:35,909 and the engagement of the community. 7 00:00:37,230 --> 00:00:38,670 I am really excited about being here 8 00:00:38,671 --> 00:00:42,296 and getting a chance to talk about work that we've been doing. 9 00:00:42,914 --> 00:00:47,398 This is doing work with Michael, who's also here in the third row. 10 00:00:49,551 --> 00:00:51,921 But before I dive more into this, 11 00:00:51,922 --> 00:00:55,515 this wouldn't be a Google presentation without an ad, 12 00:00:56,102 --> 00:00:58,196 so you get that up front. 13 00:00:58,196 --> 00:01:01,242 This is what I'll be talking about, our project, the SLING project. 14 00:01:02,255 --> 00:01:06,640 It is an open source project and it's using Wikidata a lot. 15 00:01:08,020 --> 00:01:11,721 You can go check it out on GitHub when you get a chance 16 00:01:11,722 --> 00:01:15,960 if you feel excited about it after the presentation. 17 00:01:18,215 --> 00:01:23,493 And really, what I wanted to talk about-- the title is admittedly a little bit long, 18 00:01:23,494 --> 00:01:25,797 it's even shorter than it was in the original program. 19 00:01:25,798 --> 00:01:29,704 But what it comes down to, what the project comes down to 20 00:01:29,704 --> 00:01:33,617 is trying to answer this one very exciting question. 21 00:01:34,810 --> 00:01:38,218 If you want, in the beginning, there were just two files, 22 00:01:39,914 --> 00:01:41,400 some of you may recognize them, 23 00:01:42,416 --> 00:01:45,953 they're essentially the dump files from Wikidata and Wikipedia, 24 00:01:47,234 --> 00:01:50,280 and the question we're trying to figure out or answer is really, 25 00:01:51,570 --> 00:01:54,423 can we dramatically improve how good machines are 26 00:01:54,424 --> 00:01:58,062 at understanding human language just by using these files? 27 00:02:00,900 --> 00:02:04,158 And of course, you're entitled to ask 28 00:02:04,158 --> 00:02:06,191 whether that's an interesting question to answer. 29 00:02:07,450 --> 00:02:14,344 If you're a company that [inaudible] is to be able to take search queries 30 00:02:14,344 --> 00:02:17,656 and try to answer them in the best possible way, 31 00:02:18,460 --> 00:02:23,989 obviously, understanding natural language comes in as a very handy thing. 32 00:02:25,317 --> 00:02:27,914 But even if you look at Wikidata, 33 00:02:29,109 --> 00:02:33,843 in the previous data quality panel earlier today, 34 00:02:33,843 --> 00:02:39,070 there was a question that came up about verification, or verifiability of facts. 35 00:02:39,070 --> 00:02:42,623 So let's say you actually do understand natural language. 36 00:02:42,623 --> 00:02:47,304 If you have a fact and there's a source, you could go to the source and analyze it, 37 00:02:47,304 --> 00:02:49,721 and you can figure out whether it actually confirms the fact 38 00:02:49,722 --> 00:02:52,282 that is claiming to have this as a source. 39 00:02:53,459 --> 00:02:55,540 And if you could do it, you could even go beyond that 40 00:02:55,541 --> 00:02:59,723 and you could read articles and annotate them, come up with facts, 41 00:02:59,723 --> 00:03:03,478 and actually look for existing facts that may need sources 42 00:03:03,479 --> 00:03:06,109 and add these articles as sources. 43 00:03:07,110 --> 00:03:11,371 Or, you know, in the wildest, craziest possible of all worlds, 44 00:03:11,371 --> 00:03:13,756 if you get really, really good at it you could read articles 45 00:03:13,756 --> 00:03:18,243 and maybe even annotate with new facts that you could then suggest as facts 46 00:03:18,244 --> 00:03:19,965 that you could potentially add to Wikidata. 47 00:03:20,595 --> 00:03:27,025 But there's a whole world of applications of natural language understanding. 48 00:03:28,895 --> 00:03:32,478 One of the things that's really hard when you do natural language understanding-- 49 00:03:32,479 --> 00:03:35,595 these days, that also means deep learning or machine learning, 50 00:03:35,596 --> 00:03:39,537 and one of the things that's really hard is getting enough training data. 51 00:03:39,537 --> 00:03:42,812 And historically, that's meant having a lot of text 52 00:03:42,812 --> 00:03:45,441 that you need human annotators to then first process 53 00:03:45,442 --> 00:03:46,801 and then you can do training. 54 00:03:46,802 --> 00:03:51,184 And part of the question here is also really to say: 55 00:03:51,184 --> 00:03:55,930 Can we use Wikidata and the way in which it's interlinked with Wikipedia 56 00:03:57,012 --> 00:03:58,012 for training data, 57 00:03:58,013 --> 00:04:00,600 and will that be enough to train that model? 58 00:04:03,429 --> 00:04:06,517 So hopefully, we'll get closer to answering this question 59 00:04:06,518 --> 00:04:09,289 in the next 15 to 20 minutes. 60 00:04:10,271 --> 00:04:14,071 We don't quite know the answer yet but we have some exciting results 61 00:04:14,072 --> 00:04:16,992 that are pointing in the right direction, if you want. 62 00:04:19,387 --> 00:04:23,798 Just take a step back in terms of the development we've seen, 63 00:04:24,445 --> 00:04:28,450 machine learning and deep learning has revolutionized a lot of areas 64 00:04:28,450 --> 00:04:32,431 and this is just one example of a particular image recognition task 65 00:04:32,432 --> 00:04:37,343 that if you look at what happened between 2010 and 2015, 66 00:04:37,344 --> 00:04:40,881 in that five-year period, we went from machines doing pretty poorly 67 00:04:40,882 --> 00:04:44,921 to, in the end, actually performing at the same level of humans 68 00:04:44,922 --> 00:04:48,804 or in some cases even better albeit for a very specific task. 69 00:04:50,224 --> 00:04:55,515 So we've seen really a lot of things improving dramatically. 70 00:04:56,221 --> 00:04:57,881 And so you can ask 71 00:04:57,882 --> 00:05:02,440 why don't we just throw deep learning at natural language processing 72 00:05:02,440 --> 00:05:04,600 and natural language understanding and be done with it? 73 00:05:05,497 --> 00:05:11,532 And the answer is kind of we've sort of done to a certain extent, 74 00:05:11,532 --> 00:05:14,367 but what it turns out is that 75 00:05:15,005 --> 00:05:17,725 natural language understanding is actually still a bit of a challenge 76 00:05:17,726 --> 00:05:23,281 and one of the situations where a lot of us interact with machines 77 00:05:23,282 --> 00:05:25,803 that are trying to behave like they understand what we're saying 78 00:05:25,804 --> 00:05:26,804 is in these chat bots. 79 00:05:26,805 --> 00:05:28,605 So this is not to pick on anyone in particular 80 00:05:28,606 --> 00:05:31,991 but just, I think, an experience that a lot of us have had. 81 00:05:31,992 --> 00:05:36,841 In this case, it's a user saying I want to stay in this place. 82 00:05:36,842 --> 00:05:41,766 The chat bot says: "OK, got it, when will you be checking in and out? 83 00:05:41,766 --> 00:05:44,488 For example, November 17th to 23rd." 84 00:05:44,488 --> 00:05:46,620 And the user says: "Well, I don't have any dates yet." 85 00:05:46,620 --> 00:05:47,681 And then the response is: 86 00:05:47,682 --> 00:05:51,050 "Sorry, there are no hotels available for the dates you've requested. 87 00:05:51,050 --> 00:05:52,571 Would you like to start a new search?" 88 00:05:53,212 --> 00:05:55,041 So there's still some way to go 89 00:05:55,862 --> 00:05:58,755 to get machines to really understand human language. 90 00:05:59,817 --> 00:06:03,761 But machine learning or deep learning 91 00:06:03,762 --> 00:06:06,786 has been applied already to this discipline. 92 00:06:06,787 --> 00:06:09,721 Like, one of the examples is a recent... 93 00:06:09,722 --> 00:06:11,232 a more successful example is BERT 94 00:06:11,233 --> 00:06:17,316 where they're using transformers to solve NLP or NLU tasks. 95 00:06:18,800 --> 00:06:22,157 And it's dramatically improved the performance but, as we've seen, 96 00:06:22,157 --> 00:06:23,560 there is still some way to go. 97 00:06:25,150 --> 00:06:27,857 One thing that's shared among most of these approaches 98 00:06:27,858 --> 00:06:31,785 is that you look at the text itself 99 00:06:31,785 --> 00:06:36,629 and you depend on having a lot of it so you can train your model on the text, 100 00:06:36,629 --> 00:06:39,761 but everything is based on just looking at the text 101 00:06:39,762 --> 00:06:41,675 and understanding the text. 102 00:06:41,675 --> 00:06:45,727 So the learning is really just representation learning. 103 00:06:45,727 --> 00:06:50,653 What we wanted to do is actually understand and annotate the text 104 00:06:50,653 --> 00:06:54,006 in terms of items or entities in the real world. 105 00:06:56,384 --> 00:06:59,537 And in general, if we take a step back, 106 00:07:00,077 --> 00:07:03,441 why is natural language processing or understanding so hard? 107 00:07:03,442 --> 00:07:07,659 There are a number of reasons why it's really hard, but at the core, 108 00:07:07,659 --> 00:07:11,041 one of the important reasons is that somehow, 109 00:07:11,042 --> 00:07:13,225 the machine needs to have knowledge of the world 110 00:07:13,226 --> 00:07:16,867 in order to understand human language. 111 00:07:19,569 --> 00:07:22,456 And you think about that for a little while. 112 00:07:23,074 --> 00:07:26,654 What better place to look for knowledge about the world than Wikidata? 113 00:07:27,318 --> 00:07:29,625 So in essence, that's the approach. 114 00:07:29,625 --> 00:07:31,985 And the question is can you leverage it, 115 00:07:31,985 --> 00:07:38,877 can you use this wonderful knowledge 116 00:07:38,878 --> 00:07:40,601 of the world that we already have 117 00:07:40,602 --> 00:07:45,617 in a way that you can help to train and bootstrap your model. 118 00:07:47,390 --> 00:07:51,121 So the alternative here is really understanding the text 119 00:07:51,122 --> 00:07:55,439 not just in terms of other texts or how this text is similar to other texts 120 00:07:55,439 --> 00:07:59,104 but in terms of the existing knowledge that we have about the world. 121 00:08:01,164 --> 00:08:02,704 And what makes me really excited 122 00:08:02,705 --> 00:08:05,905 or at least makes me have a good gut feeling about this 123 00:08:05,906 --> 00:08:07,372 is that in some ways 124 00:08:07,373 --> 00:08:10,780 it seems closer to how we interact as humans. 125 00:08:10,780 --> 00:08:13,795 So if we were having a conversation 126 00:08:13,795 --> 00:08:17,847 and you were bringing up the Bundeskanzler and Angela Merkel, 127 00:08:18,662 --> 00:08:23,173 I would have an internal representation of Q567 and it would light up. 128 00:08:23,173 --> 00:08:25,521 And in our continued conversation, 129 00:08:25,522 --> 00:08:29,615 mentioning other things related to Angela Merkel, 130 00:08:29,616 --> 00:08:31,762 I would have an easier time associating with that 131 00:08:31,763 --> 00:08:33,920 or figuring out what you were actually talking about. 132 00:08:35,027 --> 00:08:38,919 And so, in essence, that's at the heart of this approach, 133 00:08:38,919 --> 00:08:42,100 that we really believe Wikidata is a key component 134 00:08:42,101 --> 00:08:45,809 in unlocking this better understanding of natural language. 135 00:08:49,732 --> 00:08:51,448 And so how are we planning to do it? 136 00:08:52,557 --> 00:08:56,797 Essentially, there are five steps we're going through, 137 00:08:56,798 --> 00:08:58,080 or have been going through. 138 00:08:58,788 --> 00:09:02,841 I'll go over each of the steps briefly in turn 139 00:09:02,841 --> 00:09:04,410 but essentially, there are five steps. 140 00:09:04,410 --> 00:09:07,120 First, we need to start with the dump files that I showed you 141 00:09:07,120 --> 00:09:08,120 to begin with-- 142 00:09:08,706 --> 00:09:11,149 understanding what's in them, parsing them, 143 00:09:11,149 --> 00:09:13,397 having an efficient internal representation in memory 144 00:09:13,397 --> 00:09:15,716 that allows us to do quick processing on this. 145 00:09:16,225 --> 00:09:18,502 And then, we're leveraging some of the annotations 146 00:09:18,503 --> 00:09:22,605 that are already in Wikipedia, linking it to items in Wikidata. 147 00:09:22,605 --> 00:09:25,462 I'll briefly show you what I mean by that. 148 00:09:25,462 --> 00:09:31,001 We can use that to then generate more advanced annotations 149 00:09:31,973 --> 00:09:34,549 where we have much more text annotated. 150 00:09:34,549 --> 00:09:40,333 But still, with annotations being items or facts in Wikidata, 151 00:09:40,334 --> 00:09:43,717 we can then train a model based on the silver data 152 00:09:43,717 --> 00:09:46,212 and get a reasonably good model 153 00:09:46,212 --> 00:09:49,047 that will allow us to read a Wikipedia document 154 00:09:49,047 --> 00:09:53,308 and understand what the actual content is in terms of Wikidata, 155 00:09:54,613 --> 00:09:57,580 but only for facts that are already in Wikidata. 156 00:09:58,523 --> 00:10:02,367 And so that's where kind of the hard part of this begins. 157 00:10:02,367 --> 00:10:06,100 In order to go beyond that we need to have a plausibility model, 158 00:10:06,100 --> 00:10:07,641 so a model that can tell us, 159 00:10:07,642 --> 00:10:10,881 given a lot of facts about an item and an additional fact, 160 00:10:10,882 --> 00:10:12,627 whether the additional fact is plausible. 161 00:10:13,191 --> 00:10:14,296 If we can build that, 162 00:10:14,892 --> 00:10:21,831 we can then use a more "hyper modern" reinforcement learning aspect 163 00:10:21,832 --> 00:10:26,033 of deep learning and machine learning to fine-tune the model 164 00:10:26,033 --> 00:10:30,303 and hopefully go beyond what we've been able to so far. 165 00:10:31,933 --> 00:10:32,933 So real quick, 166 00:10:32,934 --> 00:10:36,632 the first step is essentially getting the dump files parsed, 167 00:10:36,632 --> 00:10:41,021 understanding the contents, and linking up Wikidata and Wikipedia information, 168 00:10:41,022 --> 00:10:44,416 and then utilizing some of the annotations that are already there. 169 00:10:45,547 --> 00:10:49,304 And so this is essentially what's happening. 170 00:10:49,305 --> 00:10:51,959 Trust me, Michael built all of this, it's working great. 171 00:10:52,701 --> 00:10:55,621 But essentially, we're starting with the two files you can see on the top, 172 00:10:55,622 --> 00:10:58,244 the Wikidata dump and the Wikipedia dump. 173 00:10:58,245 --> 00:11:02,413 The Wikidata dump gets processed and we end up with a knowledge base, 174 00:11:02,413 --> 00:11:04,376 a KB at the bottom. 175 00:11:04,377 --> 00:11:07,335 That's essentially a store we can hold in memory 176 00:11:07,336 --> 00:11:10,439 that has essentially all of Wikidata in it 177 00:11:10,440 --> 00:11:13,841 and we can quickly access all the properties and facts and so on 178 00:11:13,841 --> 00:11:15,163 and do analysis there. 179 00:11:15,164 --> 00:11:16,414 Similarly, for the documents, 180 00:11:16,415 --> 00:11:18,486 they get processed and we end up with documents 181 00:11:19,274 --> 00:11:21,911 that have been processed. 182 00:11:21,912 --> 00:11:23,544 We know all the mentions 183 00:11:23,545 --> 00:11:26,838 and some of the things that are already in the documents. 184 00:11:26,839 --> 00:11:27,839 And then in the middle, 185 00:11:27,840 --> 00:11:30,093 we have an important part which is a phrase table 186 00:11:30,094 --> 00:11:33,081 that allows us to basically see for any phrase 187 00:11:34,096 --> 00:11:35,753 what is the frequency distribution, 188 00:11:35,754 --> 00:11:39,481 what's the most likely item that we're referring to 189 00:11:39,481 --> 00:11:41,165 when we're using this phrase. 190 00:11:41,165 --> 00:11:44,445 So we're using that later on to build the silver annotations. 191 00:11:44,446 --> 00:11:48,001 So let's say we've run this and then we also want to make sure 192 00:11:48,002 --> 00:11:51,691 we utilize annotations that are already there. 193 00:11:51,692 --> 00:11:54,112 So an important part of a Wikipedia article 194 00:11:54,113 --> 00:11:57,841 is that it's not just plain text, 195 00:11:57,842 --> 00:12:01,007 it's actually already pre-annotated with a few things. 196 00:12:01,008 --> 00:12:04,046 So a template is one example, links is another example. 197 00:12:04,047 --> 00:12:08,017 So if we take here the English article for Angela Merkel, 198 00:12:09,387 --> 00:12:12,301 there is one example of a link here which is to her party. 199 00:12:12,302 --> 00:12:13,772 If you look at the bottom, 200 00:12:13,773 --> 00:12:16,426 that's a link to a specific Wikipedia article, 201 00:12:16,427 --> 00:12:20,155 and I guess for people here, it's no surprise that, in essence, 202 00:12:20,156 --> 00:12:23,360 that is then, if you look at the associated Wikidata item, 203 00:12:23,361 --> 00:12:25,801 that's essentially an annotation saying 204 00:12:25,802 --> 00:12:31,453 this is the QID I am talking about when I'm talking about this party, 205 00:12:31,453 --> 00:12:32,820 the Christian Democratic Union. 206 00:12:33,951 --> 00:12:37,281 So we're using this to already have a good start 207 00:12:37,282 --> 00:12:39,326 in terms of understanding what text means. 208 00:12:39,327 --> 00:12:40,327 All of these links, 209 00:12:40,328 --> 00:12:43,983 we know exactly what the author means with the phrase 210 00:12:44,504 --> 00:12:47,040 in the cases where there are links to QIDs. 211 00:12:48,234 --> 00:12:53,303 We can use this and the phrase table to then try and take a Wikipedia document 212 00:12:53,304 --> 00:12:58,760 and fully annotate it with everything we know about already from Wikidata. 213 00:12:59,659 --> 00:13:02,753 And we can use this to train the first iteration of our model. 214 00:13:03,933 --> 00:13:04,933 (coughs) Excuse me. 215 00:13:04,934 --> 00:13:07,876 So this is exactly the same article, 216 00:13:08,400 --> 00:13:13,566 but now, after we've annotated it with silver annotations, 217 00:13:14,673 --> 00:13:18,441 and essentially, you can see all of the squares 218 00:13:18,442 --> 00:13:24,530 are places where we've been able to annotate with QIDs or with facts. 219 00:13:26,362 --> 00:13:30,681 This is just a screenshot of the viewer on the data, 220 00:13:30,682 --> 00:13:34,281 so you can have access to all of this information 221 00:13:34,282 --> 00:13:37,577 and see what's come out of the silver annotation. 222 00:13:37,577 --> 00:13:41,364 And it's important to say that there's no machine learning 223 00:13:41,365 --> 00:13:42,678 or anything involved here. 224 00:13:42,679 --> 00:13:46,007 All we've done, is sort of mechanically, with a few tricks, 225 00:13:46,515 --> 00:13:49,709 basically pushed information we already have from Wikidata 226 00:13:49,710 --> 00:13:52,760 onto the Wikipedia article. 227 00:13:53,328 --> 00:13:56,202 And so here, if you hover over "Chancellor of Germany" 228 00:13:56,202 --> 00:14:01,973 that is itself a Wikidata, that's referring to a Wikidata item, 229 00:14:01,974 --> 00:14:04,972 has a number of properties like "subclass of: Chancellor", 230 00:14:04,972 --> 00:14:08,658 "country: Germany", that again referring to subtext. 231 00:14:08,659 --> 00:14:11,732 And here, it also has the property "officeholder" 232 00:14:12,473 --> 00:14:15,496 which happens to be Angela Dorothea Merkel, 233 00:14:15,497 --> 00:14:17,051 which is also mentioned in the text. 234 00:14:17,052 --> 00:14:22,137 So there's really a full annotation linking up the contents here. 235 00:14:24,645 --> 00:14:27,429 But again, there is an important and unfortunate point 236 00:14:27,430 --> 00:14:31,563 about what we are able to and not able to do here. 237 00:14:31,564 --> 00:14:35,342 So what we are doing is pushing information we already have in Wikidata, 238 00:14:35,342 --> 00:14:40,169 so what we can't annotate here are things that are not in Wikidata. 239 00:14:40,169 --> 00:14:41,681 So for instance, here, 240 00:14:41,682 --> 00:14:44,910 she was at some point appointed Federal Minister for Women and Youth 241 00:14:44,910 --> 00:14:48,713 and that alias or that phrase is not in Wikidata, 242 00:14:48,713 --> 00:14:54,000 so we're not able to make that annotation here in our silver annotations. 243 00:14:56,227 --> 00:14:59,943 That said, it's still... at least for me, 244 00:14:59,944 --> 00:15:02,625 it's was pretty surprising to see how much you can actually annotate 245 00:15:02,626 --> 00:15:04,266 and how much information is already there 246 00:15:04,267 --> 00:15:08,877 when you combine Wikidata with a Wikipedia article. 247 00:15:08,878 --> 00:15:15,321 So what you can do is, once you have this, you know, millions of documents, 248 00:15:16,275 --> 00:15:20,240 you can train your parser based on the annotations that are there. 249 00:15:21,134 --> 00:15:26,968 And that's essentially a parser that has a number of components. 250 00:15:26,969 --> 00:15:30,481 Essentially, the text is coming in at the bottom and at the top, 251 00:15:30,482 --> 00:15:33,722 we have a transition-based frame semantic parser 252 00:15:33,723 --> 00:15:39,154 that then generates the annotations or these facts or references to the items. 253 00:15:40,617 --> 00:15:44,987 We built this and run on more classical corpora 254 00:15:44,987 --> 00:15:49,611 like [inaudible], which are more classical NLP corpora, 255 00:15:49,611 --> 00:15:53,800 but we want to be able to run this on the full Wikipedia corpora. 256 00:15:53,800 --> 00:15:57,201 So Michael has been rewriting this in C++ 257 00:15:57,202 --> 00:15:59,932 and we're able to really scale up performance 258 00:15:59,932 --> 00:16:01,101 of the parser trainer here. 259 00:16:01,102 --> 00:16:03,594 So it will be exciting to see exactly 260 00:16:03,595 --> 00:16:05,830 the results that are going to come out of that. 261 00:16:08,638 --> 00:16:10,263 So once that's in place, 262 00:16:10,264 --> 00:16:13,459 we have a pretty good model that's able to at least 263 00:16:13,459 --> 00:16:16,051 predict facts that are already known in Wikidata, 264 00:16:16,052 --> 00:16:18,790 but ideally, we want to move beyond that, 265 00:16:18,790 --> 00:16:20,703 and for that we need this plausibility model 266 00:16:20,704 --> 00:16:23,928 which in essence, you can think of it as a black box 267 00:16:23,929 --> 00:16:27,121 where you supply it with all of the known facts you have 268 00:16:27,122 --> 00:16:30,574 about a particular item and then you provide an additional item. 269 00:16:31,412 --> 00:16:32,412 And by magic, 270 00:16:32,413 --> 00:16:36,948 the black box tells you how plausible is the additional fact that you're providing 271 00:16:36,949 --> 00:16:40,396 and how plausible is it that this particular item is fact. 272 00:16:42,792 --> 00:16:43,792 And... 273 00:16:45,733 --> 00:16:48,582 I don't know if it's fair to say that it was much to our surprise, 274 00:16:48,582 --> 00:16:50,776 but at least, you can actually-- 275 00:16:50,776 --> 00:16:52,905 In order to train a model, you need, 276 00:16:52,905 --> 00:16:55,255 like we've seen earlier, you need a lot of training data 277 00:16:55,256 --> 00:16:57,880 and essentially, you can use Wikidata as training data. 278 00:16:57,881 --> 00:17:02,213 You serve it basically all the facts for a given item 279 00:17:02,213 --> 00:17:04,614 and then you mask or hold off one fact 280 00:17:04,615 --> 00:17:08,566 and then you provide that as a fact that it's supposed to predict. 281 00:17:09,238 --> 00:17:10,718 And just using this as training data, 282 00:17:10,719 --> 00:17:15,881 you can get a really really good plausibility model, actually, 283 00:17:18,574 --> 00:17:21,675 to the extent that I was hoping one day to maybe be able to even use it 284 00:17:21,675 --> 00:17:27,527 for discovering what you could call accidental vandalism in Wikidata 285 00:17:27,528 --> 00:17:33,011 like a fact that's been added by accident and really doesn't look like it's... 286 00:17:33,012 --> 00:17:35,029 It doesn't fit with the normal topology 287 00:17:35,029 --> 00:17:38,621 of facts or knowledge in Wikidata, if you want. 288 00:17:41,058 --> 00:17:43,761 But in this particular setup, we need it for something else, 289 00:17:43,762 --> 00:17:46,738 namely for doing reinforcement learning 290 00:17:47,951 --> 00:17:50,805 so we can fine-tune the Wiki parser, 291 00:17:50,805 --> 00:17:54,034 and basically using the plausibility model as a reward function. 292 00:17:54,035 --> 00:17:59,576 So when you do the training, you try to pass a Wikipedia document 293 00:17:59,576 --> 00:18:01,871 [inaudible] in Wikipedia comes up with a fact 294 00:18:01,871 --> 00:18:04,281 and we check the fact on the plausibility model 295 00:18:04,282 --> 00:18:07,527 and use that as feedback or as a reward function 296 00:18:08,198 --> 00:18:09,601 in training the model. 297 00:18:09,602 --> 00:18:12,708 And the big question here is then can we learn to predict facts 298 00:18:12,709 --> 00:18:15,000 that are not already in Wikidata. 299 00:18:15,800 --> 00:18:22,300 And we hope and believe we can but it's still not clear. 300 00:18:22,879 --> 00:18:27,792 So this is essentially what we have been and are planning to do. 301 00:18:27,792 --> 00:18:31,223 There's been some surprisingly good results 302 00:18:31,224 --> 00:18:33,989 in terms of how far you can get with silver annotations 303 00:18:33,990 --> 00:18:35,720 and a plausibility model. 304 00:18:36,271 --> 00:18:40,081 But in terms of how far we are, if you want, 305 00:18:40,082 --> 00:18:41,961 we sort of have the infrastructure in place 306 00:18:41,962 --> 00:18:44,480 to do the processing and have everything efficiently in memory. 307 00:18:45,121 --> 00:18:49,138 We have first instances of silver annotations 308 00:18:49,139 --> 00:18:53,041 and have a parser trainer in place for the supervised learning 309 00:18:53,042 --> 00:18:55,755 and an initial plausibility model. 310 00:18:55,756 --> 00:19:00,400 But we're still pushing on those fronts and very much looking forward 311 00:19:00,400 --> 00:19:03,320 to see what comes out of the very last bit. 312 00:19:07,786 --> 00:19:10,309 And those were my words. 313 00:19:10,310 --> 00:19:14,681 I'm very excited to see what comes out of it 314 00:19:14,682 --> 00:19:17,661 and it's been pure joy to work with Wikidata. 315 00:19:17,662 --> 00:19:19,513 It's been fun to see 316 00:19:19,514 --> 00:19:23,917 how some of the things you come across seemed wrong and then the next day, 317 00:19:23,918 --> 00:19:24,958 you look, things are fixed 318 00:19:24,959 --> 00:19:30,551 and it's really been amazing to see the momentum there. 319 00:19:31,161 --> 00:19:35,295 Like I said, the URL, all the source code is on GitHub. 320 00:19:35,887 --> 00:19:38,912 Our email addresses were on the first slide, 321 00:19:38,913 --> 00:19:42,582 so please do reach out if you have questions or are interested 322 00:19:42,582 --> 00:19:47,149 and I think we have time for a couple questions now in case... 323 00:19:49,450 --> 00:19:51,446 (applause) 324 00:19:51,447 --> 00:19:52,447 Thanks. 325 00:19:55,583 --> 00:19:59,400 (woman 1) Thank you for your presentation. I do have a concern however. 326 00:19:59,401 --> 00:20:05,441 The Wikipedia corpus is known to be with bias. 327 00:20:05,442 --> 00:20:09,841 There's a very strong bias-- for example, fewer women, more men, 328 00:20:09,842 --> 00:20:11,787 all sorts of other aspects in there. 329 00:20:11,787 --> 00:20:15,201 So isn't this actually also tainting the knowledge 330 00:20:15,202 --> 00:20:19,471 that you are taking out of the Wikipedia? 331 00:20:22,320 --> 00:20:25,424 Well, there are two aspects of the question. 332 00:20:25,425 --> 00:20:28,591 There's both in the model that we are then training, 333 00:20:28,591 --> 00:20:32,495 you could ask how... let's just... 334 00:20:33,172 --> 00:20:35,841 If you make it really simple and say like: 335 00:20:35,842 --> 00:20:41,204 Does it mean that the model will then be worse 336 00:20:41,204 --> 00:20:46,027 at predicting facts about women than men, say, 337 00:20:46,027 --> 00:20:50,416 or some other set of groups? 338 00:20:53,098 --> 00:20:55,424 To begin with, if you just look at the raw data, 339 00:20:55,425 --> 00:21:00,529 it will reflect whatever is the bias in the training data, so that's... 340 00:21:02,810 --> 00:21:06,001 People work on this to try and address that in the best possible way. 341 00:21:06,002 --> 00:21:10,068 But normally, when you're training a model, 342 00:21:10,069 --> 00:21:14,244 it will reflect whatever data you're training it on. 343 00:21:14,870 --> 00:21:18,980 So that's something to account for when doing the work, yeah. 344 00:21:21,498 --> 00:21:23,194 (man 2) Hi, this is [Marco]. 345 00:21:23,195 --> 00:21:25,960 I am a natural language processing practitioner. 346 00:21:26,853 --> 00:21:31,578 I was curious about how you model your facts. 347 00:21:31,578 --> 00:21:34,535 So I heard you set frame semantics, 348 00:21:34,535 --> 00:21:35,557 Right. 349 00:21:35,557 --> 00:21:38,875 (Mike) could you maybe give some more details on that, please. 350 00:21:40,053 --> 00:21:46,510 Yes, so it's frame semantics, we're using frame semantics, 351 00:21:46,510 --> 00:21:49,642 and basically, 352 00:21:49,642 --> 00:21:55,778 all of the facts in Wikidata, they're modeled as frames. 353 00:21:56,291 --> 00:21:58,801 And so that's an essential part of the set up 354 00:21:58,811 --> 00:22:00,027 and how we make this work. 355 00:22:00,028 --> 00:22:03,770 That's essentially how we try to address the... 356 00:22:03,771 --> 00:22:06,680 How can I make all the knowledge that I have in Wikidata 357 00:22:06,680 --> 00:22:11,012 available in a context where I can annotate and train my model 358 00:22:12,485 --> 00:22:14,441 when I am annotating or passing text. 359 00:22:14,442 --> 00:22:19,806 Is that existing data in Wikidata is modeled as frames. 360 00:22:19,806 --> 00:22:21,007 So the store that we have, 361 00:22:21,008 --> 00:22:24,041 the knowledge base with all of the knowledge is a frame store, 362 00:22:24,042 --> 00:22:27,251 and this is the same frame store that we are building on top of 363 00:22:27,251 --> 00:22:29,521 when we're then passing the text. 364 00:22:29,522 --> 00:22:34,024 (Marco) So you're converting the Wikidata data model into some frame. 365 00:22:34,551 --> 00:22:36,703 Yes, we are converting the Wikidata model 366 00:22:36,704 --> 00:22:39,871 into one large frame store if you want, yeah. 367 00:22:40,558 --> 00:22:43,605 (man 3) Thanks. Is Pluto a planet? 368 00:22:44,394 --> 00:22:47,226 (audience laughing) 369 00:22:47,227 --> 00:22:48,227 Can I get the question... 370 00:22:48,228 --> 00:22:51,561 (man 3) I like the bootstrapping thing that you are doing, 371 00:22:51,562 --> 00:22:53,402 I mean the way that you're training your model 372 00:22:53,403 --> 00:22:57,726 by picking out the known facts about things that are verified, 373 00:22:57,727 --> 00:23:00,666 and then training the plausibility prediction 374 00:23:00,667 --> 00:23:03,681 by trying to teach the architecture of the system 375 00:23:03,682 --> 00:23:06,481 to recognize that actually, that fact fits. 376 00:23:06,482 --> 00:23:13,464 So that will work for large classes, but it will really... 377 00:23:13,464 --> 00:23:15,744 It doesn't sound like it will learn about surprises 378 00:23:15,745 --> 00:23:18,677 and especially not in small classes of items, right. 379 00:23:18,677 --> 00:23:20,841 So if you train your model in... 380 00:23:20,842 --> 00:23:23,481 When did Pluto disappear, I forgot... 381 00:23:23,482 --> 00:23:24,482 As a planet, you mean. 382 00:23:24,483 --> 00:23:26,900 (man 3) Yeah, it used to be a member of the solar system 383 00:23:26,900 --> 00:23:29,437 and we had how many, nine observations there. 384 00:23:29,437 --> 00:23:31,167 - Yeah. - (man 3) It's slightly problematic. 385 00:23:31,168 --> 00:23:33,514 So everyone, the kids think that Pluto is not a planet, 386 00:23:33,515 --> 00:23:36,039 I still think it's a planet, but never mind. 387 00:23:36,040 --> 00:23:42,320 So the fact that it suddenly stopped being a planet, 388 00:23:42,321 --> 00:23:45,521 which was supported in the period before, I don't know, hundreds of years, right? 389 00:23:47,150 --> 00:23:50,161 That's crazy, how would you go for figuring out that thing? 390 00:23:50,162 --> 00:23:53,595 For example, the new claim is not plausible for that thing. 391 00:23:53,595 --> 00:23:55,886 Sure. So there are two things. 392 00:23:55,887 --> 00:23:59,430 So there's both like how precise is a plausibility model. 393 00:23:59,431 --> 00:24:02,086 So what it distinguishes between is random facts 394 00:24:02,087 --> 00:24:03,600 and facts that are plausible. 395 00:24:04,105 --> 00:24:06,600 And there's also the question of whether Pluto is a planet 396 00:24:06,601 --> 00:24:09,241 and that's back to whether... 397 00:24:09,242 --> 00:24:10,339 I was in another session 398 00:24:10,340 --> 00:24:14,060 where someone brought up the example of the earth being flat, 399 00:24:14,060 --> 00:24:16,547 - whether that is a fact or not. - (man 3) That makes sense. 400 00:24:16,548 --> 00:24:18,508 So it is a fact in a sense that you can put it in, 401 00:24:18,509 --> 00:24:19,950 I guess you could put it in Wikidata 402 00:24:19,951 --> 00:24:22,031 with sources that are claiming that that's the thing. 403 00:24:22,032 --> 00:24:26,561 So again, you would not necessarily want to train the model in a way 404 00:24:26,562 --> 00:24:30,721 where if you read someone saying the planet Pluto, bla, bla, bla, 405 00:24:30,722 --> 00:24:33,561 then it should be fine for it 406 00:24:33,562 --> 00:24:36,561 to then say that an annotation for this text 407 00:24:36,562 --> 00:24:38,200 is that Pluto is a planet. 408 00:24:39,509 --> 00:24:41,432 That doesn't mean, you know... 409 00:24:42,120 --> 00:24:46,918 The model won't be able to tell what "in the end" is the truth, 410 00:24:46,919 --> 00:24:49,214 I don't think any of us here will be able to either, so... 411 00:24:49,214 --> 00:24:50,285 (man 3) I just want to say 412 00:24:50,285 --> 00:24:52,775 it's not a hard accusation against the approach 413 00:24:52,776 --> 00:24:56,028 because even people cannot be sure whether that's a fact, 414 00:24:56,029 --> 00:24:58,214 a new fact is plausible at that moment. 415 00:24:58,730 --> 00:24:59,730 But that's always... 416 00:24:59,731 --> 00:25:03,386 I just maybe reiterated a question that I am posing all the time 417 00:25:03,387 --> 00:25:05,750 to myself and my work; I always ask. 418 00:25:06,311 --> 00:25:09,267 We do the statistical learning thing, it's amazing nowadays 419 00:25:09,268 --> 00:25:13,585 we can do billions of things, but we cannot learn about surprises, 420 00:25:13,586 --> 00:25:16,840 and they are very, very important in fact, right? 421 00:25:17,595 --> 00:25:20,711 - (man 4) But, just to refute... - (man 3) Thank you. 422 00:25:22,567 --> 00:25:26,551 (man 4) The plausibility model is combined with kind of two extra roles. 423 00:25:26,551 --> 00:25:30,361 First of all, if it's in Wikidata, it's true. 424 00:25:30,362 --> 00:25:34,635 We just give you the benefit of the doubt, so please make it good. 425 00:25:34,636 --> 00:25:39,261 The second thing is if it's not allowed by the schema it's false; 426 00:25:39,770 --> 00:25:42,504 it's all the things in between we're looking at. 427 00:25:43,436 --> 00:25:50,366 So if it's a planet according to Wikidata, it will be a true fact. 428 00:25:53,130 --> 00:25:57,406 But it won't predict surprises but what is important here 429 00:25:57,407 --> 00:26:01,814 is that there's kind of no manual human work involved, 430 00:26:01,814 --> 00:26:03,629 so there's nothing that prevents you from... 431 00:26:03,629 --> 00:26:05,936 Well, now, if we're successful with the approach, 432 00:26:05,937 --> 00:26:09,019 there's nothing that prevents him from continuously updating the model 433 00:26:09,019 --> 00:26:12,483 with changes happening in Wikidata and Wikipedia and so on. 434 00:26:12,484 --> 00:26:18,128 So in theory, you should be able to quickly learn new surprises. 435 00:26:18,129 --> 00:26:19,657 (moderator) One last question. 436 00:26:20,223 --> 00:26:23,157 - (man 4) Maybe we're biased by Wikidata. - Yeah. 437 00:26:23,683 --> 00:26:27,561 (man 4) You are our bias. Whatever you annotate is what we believe. 438 00:26:27,562 --> 00:26:31,701 So if you make it good, if you make it balanced, 439 00:26:31,702 --> 00:26:33,953 we can hopefully be balanced. 440 00:26:33,954 --> 00:26:39,365 With the gender thing, there's actually an interesting thing. 441 00:26:39,951 --> 00:26:42,299 We are actually getting more training facts 442 00:26:42,300 --> 00:26:43,649 about women than men 443 00:26:43,650 --> 00:26:48,954 because "she" is a much less ambiguous pronoun in the text, 444 00:26:48,954 --> 00:26:51,600 so we actually get a lot more true facts about women. 445 00:26:51,600 --> 00:26:55,189 So we are biased, but on the women's side. 446 00:26:56,241 --> 00:26:58,924 (woman 2) No, I want to see the data on that. 447 00:26:58,925 --> 00:27:00,471 (audience laughing) 448 00:27:00,471 --> 00:27:02,381 We should bring that along next time. 449 00:27:02,381 --> 00:27:04,945 (man 4) You get had decision [inaudible]. 450 00:27:04,945 --> 00:27:06,285 (man 3) Yes, hard decision. 451 00:27:07,885 --> 00:27:13,001 (man 5) It says SLING is... parser across many languages 452 00:27:13,002 --> 00:27:15,163 - and you showed us English. - Yes! 453 00:27:15,163 --> 00:27:17,934 (man 5) Can you something about the number of languages that you are-- 454 00:27:17,934 --> 00:27:19,155 Yes! Thank you for asking. 455 00:27:19,155 --> 00:27:21,602 I had told myself to say that up front on the first page 456 00:27:21,602 --> 00:27:23,363 because otherwise, I would forget, and I did. 457 00:27:24,742 --> 00:27:25,742 So right now, 458 00:27:25,743 --> 00:27:29,876 we're not actually looking at two files, we're looking at 13 files. 459 00:27:29,877 --> 00:27:32,768 So Wikipedia dumps from 12 different languages 460 00:27:32,769 --> 00:27:35,801 that we're processing, 461 00:27:35,802 --> 00:27:41,483 and none of this is dependent on the language being English. 462 00:27:41,484 --> 00:27:44,280 So we're processing this for all of the 12 languages. 463 00:27:48,238 --> 00:27:49,238 Yeah. 464 00:27:49,239 --> 00:27:50,239 For now, 465 00:27:50,240 --> 00:27:56,617 they share the property of, I think, being the Latin alphabet, and so on. 466 00:27:56,617 --> 00:27:58,601 Mostly for us to be able to make sure 467 00:27:58,602 --> 00:28:02,121 that what we are doing still make sense and works. 468 00:28:02,121 --> 00:28:04,961 But there's nothing fundamental about the approach 469 00:28:04,962 --> 00:28:09,869 that prevents it from being used in very different languages 470 00:28:09,870 --> 00:28:14,656 from those being spoken around this area. 471 00:28:17,275 --> 00:28:19,321 (woman 3) Leila from Wikimedia Foundation. 472 00:28:19,322 --> 00:28:21,850 I may have missed this when you presented this. 473 00:28:22,904 --> 00:28:28,385 Do you make an attempt to bring any references from Wikipedia articles 474 00:28:28,386 --> 00:28:32,433 back to the property and statements you're making in Wikidata? 475 00:28:33,357 --> 00:28:37,222 So I briefly mentioned this as a potential application. 476 00:28:37,222 --> 00:28:40,352 So for now, what we're trying to do is just to get this to work, 477 00:28:41,156 --> 00:28:46,005 but let's say we did get it to work with a high level of quality, 478 00:28:46,622 --> 00:28:51,240 that would be an obvious thing to try to do, so when you... 479 00:28:52,811 --> 00:28:55,187 Let's let's say you were willing to... 480 00:28:55,187 --> 00:28:59,590 I know there's some controversy around using Wikipedia as a source for Wikidata, 481 00:28:59,590 --> 00:29:01,957 that you can't have circular references and so on, 482 00:29:01,957 --> 00:29:04,849 so you need to have properly sourced facts. 483 00:29:04,850 --> 00:29:07,420 So let's say you were coming up with new facts, 484 00:29:07,421 --> 00:29:14,307 and obviously, you could look at the cover of news media and so on 485 00:29:14,308 --> 00:29:16,220 and process these and try to annotate these. 486 00:29:16,221 --> 00:29:19,522 And then, that way, find sources for facts, 487 00:29:19,523 --> 00:29:20,964 new facts that you come up with. 488 00:29:20,965 --> 00:29:22,326 Or you could even take existing... 489 00:29:22,327 --> 00:29:25,901 There are a lot of facts in Wikidata that either have no sources 490 00:29:25,901 --> 00:29:29,641 or only have Wikipedia as a source, so you can start processing these 491 00:29:29,642 --> 00:29:32,802 and try to find sources for those automatically. 492 00:29:33,545 --> 00:29:38,198 (Leila) Or even within the articles that you're taking this information from 493 00:29:38,199 --> 00:29:41,879 just using the sources from there because they may contain... 494 00:29:42,383 --> 00:29:44,329 - Yeah. Yeah. - Yeah. Thanks. 495 00:29:47,428 --> 00:29:49,315 - (moderator) Thanks Anders. - Cool. Thanks. 496 00:29:49,919 --> 00:29:55,345 (applause)