1 00:00:06,055 --> 00:00:09,281 (moderator) Good afternoon, everybody. We're about to start. 2 00:00:09,281 --> 00:00:11,416 I'm presenting you John Samuel 3 00:00:11,416 --> 00:00:17,207 who works at the French engineering school CPE, 4 00:00:17,207 --> 00:00:19,658 based in Lyon in France. 5 00:00:19,658 --> 00:00:21,101 And he will tell us something more 6 00:00:21,101 --> 00:00:27,271 about the translation of properties in Wikidata. 7 00:00:27,271 --> 00:00:29,604 As you know, as is the case in all sessions, 8 00:00:29,604 --> 00:00:32,172 there is an etherpad for collaborative note-taking. 9 00:00:32,172 --> 00:00:34,904 Please don't forget that. 10 00:00:34,904 --> 00:00:36,302 We'll have the presentation 11 00:00:36,302 --> 00:00:39,988 and then we'll have some time for a short Q&A. 12 00:00:39,988 --> 00:00:42,051 - The floor is yours. - (John) Thanks, [inaudible]. 13 00:00:42,917 --> 00:00:45,114 Thank you all for coming here. 14 00:00:45,114 --> 00:00:50,257 So my talk is about analyzing translation of Wikidata properties. 15 00:00:50,257 --> 00:00:52,743 So just give you a quick outline. 16 00:00:52,743 --> 00:00:54,859 I would like to introduce this topic. 17 00:00:54,859 --> 00:00:58,756 I will present a tool that I developed some years before, 18 00:00:58,756 --> 00:01:01,446 called WDProp, which I'm continuously working, 19 00:01:01,446 --> 00:01:03,795 and based on the feedback from the community, 20 00:01:03,795 --> 00:01:05,319 I add new features. 21 00:01:05,319 --> 00:01:09,368 And then I will talk about something called coarser analysis, 22 00:01:09,368 --> 00:01:12,476 where I would like to look at the property translation, 23 00:01:12,476 --> 00:01:15,257 from a much larger picture. 24 00:01:15,257 --> 00:01:18,667 So I will talk about how we collected this data, 25 00:01:18,667 --> 00:01:23,002 because this work is also done with one of my students, Thibaut Chamard. 26 00:01:23,002 --> 00:01:26,682 And then I will present some results, and finally, I will conclude the talk. 27 00:01:27,469 --> 00:01:30,982 So Wikidata, as you all know, it started in 2012, 28 00:01:30,982 --> 00:01:33,877 and it's a free, open, linked, structured, collaborative, 29 00:01:33,877 --> 00:01:36,010 and multilingual knowledge base. 30 00:01:36,910 --> 00:01:40,063 My focus today is on the multilingual part, 31 00:01:40,063 --> 00:01:42,979 because there is a big change from the traditional way 32 00:01:42,979 --> 00:01:45,412 of how we used to edit on Wikipedia site. 33 00:01:45,412 --> 00:01:47,917 There were multiple subdomains, 34 00:01:47,917 --> 00:01:50,753 and now you'll have a single domain on a Wikidata 35 00:01:50,753 --> 00:01:56,191 where multilingual contributors come and write or create articles. 36 00:01:56,191 --> 00:01:57,499 So this is a collaborative. 37 00:01:57,499 --> 00:02:00,585 There has been work to say what exactly is collaborative, 38 00:02:00,585 --> 00:02:02,441 why it is collaborative. 39 00:02:02,441 --> 00:02:04,597 I have given references for these works. 40 00:02:04,597 --> 00:02:07,254 So this is, if you see Wikidata, 41 00:02:07,254 --> 00:02:11,057 everything that starts is starting from the property. 42 00:02:11,057 --> 00:02:14,144 The property is proposed and then discussed and voted. 43 00:02:14,144 --> 00:02:17,471 And then it is created and finally translated, 44 00:02:17,471 --> 00:02:20,005 and then you are finally able to use these properties. 45 00:02:20,005 --> 00:02:22,010 But these properties may also be deleted-- 46 00:02:22,010 --> 00:02:24,019 there's also something called deletion. 47 00:02:24,019 --> 00:02:26,700 But, as I highlighted on this slide, 48 00:02:26,700 --> 00:02:28,856 my focus is on the multilingual aspect, 49 00:02:28,856 --> 00:02:32,671 and the property creation and translation point of view. 50 00:02:32,671 --> 00:02:36,408 So you have been here for the past two days, 51 00:02:36,408 --> 00:02:40,095 and by this time you have seen many articles, 52 00:02:40,095 --> 00:02:46,029 and I just want to point what am I looking for on a Wikidata item. 53 00:02:46,029 --> 00:02:48,005 This is a Wikidata item, 54 00:02:48,005 --> 00:02:51,697 so you have this Q2841, which is Bogotá, 55 00:02:51,697 --> 00:02:55,597 which is the capital city of Colombia, 56 00:02:55,597 --> 00:02:57,389 and you have four parts here: 57 00:02:57,389 --> 00:03:00,678 the languages, the labels, the description, and aliases. 58 00:03:00,678 --> 00:03:02,255 So you can see, for different languages 59 00:03:02,255 --> 00:03:05,089 you'll have the label, you have the description 60 00:03:05,089 --> 00:03:10,970 as well as if there any aliases also known as, you could see them. 61 00:03:10,970 --> 00:03:14,180 And this, under the city, where you see the labels 62 00:03:14,180 --> 00:03:16,155 and the properties together. 63 00:03:16,155 --> 00:03:20,845 This is Avignon, a city in France. 64 00:03:20,845 --> 00:03:24,966 So what I'm interested in is only the properties part. 65 00:03:24,966 --> 00:03:30,638 For example, official name, native label, country, capital of, et cetera. 66 00:03:30,638 --> 00:03:34,310 So when I say property, for example, if a country, 67 00:03:34,310 --> 00:03:37,736 in this country, I'm looking at different aspects: 68 00:03:37,736 --> 00:03:39,986 the language, the label, and the description, 69 00:03:39,986 --> 00:03:42,670 and see how things change. 70 00:03:42,670 --> 00:03:44,446 For example, if you take *instance of*-- 71 00:03:44,446 --> 00:03:48,932 okay, everybody knows instance of, you have been using it quite a lot-- 72 00:03:48,932 --> 00:03:54,089 this is P31, you see the number of aliases in English 73 00:03:54,089 --> 00:03:58,667 for the property P31 in instance of, 74 00:03:58,667 --> 00:04:03,686 and then you would find that these types of properties 75 00:04:03,686 --> 00:04:07,536 are created after discussion with the community. 76 00:04:07,536 --> 00:04:10,513 So if I take the complete prop-- the procedure, 77 00:04:10,513 --> 00:04:13,343 what happens to creation of properties-- 78 00:04:13,343 --> 00:04:17,347 you start proposing properties with some possible translation. 79 00:04:17,347 --> 00:04:19,388 It is important it's not just in English. 80 00:04:19,388 --> 00:04:23,734 You have the templates to suggest your properties 81 00:04:23,734 --> 00:04:25,129 in your local language. 82 00:04:25,129 --> 00:04:28,552 So that's why it's a proposition with possible translation. 83 00:04:28,552 --> 00:04:32,367 And then you put it to discussion, then you are put to voting, 84 00:04:32,367 --> 00:04:37,273 and it's created, and then finally, the community members start translating it 85 00:04:37,273 --> 00:04:38,976 and people put it into use. 86 00:04:38,976 --> 00:04:42,336 But then you cannot be guaranteed the properties that are created 87 00:04:42,336 --> 00:04:44,435 are always there forever. 88 00:04:44,435 --> 00:04:47,417 Properties can be deleted, just like items can be deleted. 89 00:04:47,417 --> 00:04:51,004 But then, again, it goes through a similar procedure. 90 00:04:51,004 --> 00:04:54,727 You put the property 91 00:04:54,727 --> 00:04:58,427 as you propose that it should be deleted, 92 00:04:58,427 --> 00:05:02,424 and if the community decides it, it votes it, and then if it is decided-- 93 00:05:02,424 --> 00:05:05,238 the majority votes has decided to delete it-- 94 00:05:05,238 --> 00:05:09,191 we deprecate the property, and finally we delete this property. 95 00:05:09,191 --> 00:05:14,826 So for today's talk, I'm mostly interested for the translation part. 96 00:05:14,826 --> 00:05:17,004 So where are the translations happening? 97 00:05:17,004 --> 00:05:20,037 First, the translation would happen at the proposition part, 98 00:05:20,037 --> 00:05:22,778 and then you could find that, at the time of creation, 99 00:05:22,778 --> 00:05:27,917 the person who creates the property can use the exact names 100 00:05:27,917 --> 00:05:31,062 that were suggested by the property proposer 101 00:05:31,062 --> 00:05:34,753 and he or she will create the properties, 102 00:05:34,753 --> 00:05:38,705 and later, you start translating these properties. 103 00:05:38,705 --> 00:05:43,176 So let us look at why this matters, why it is important. 104 00:05:43,176 --> 00:05:44,909 So I put some examples. 105 00:05:44,909 --> 00:05:47,162 This is, again, on P31, 106 00:05:47,162 --> 00:05:51,762 instance of the very, very famous property P31, 107 00:05:51,762 --> 00:05:56,094 and you see there is no description for this item. 108 00:05:56,094 --> 00:06:00,876 There are almost six descriptions on this image, 109 00:06:00,876 --> 00:06:03,310 where we do not have any description. 110 00:06:03,310 --> 00:06:06,961 Again, some more description for Odia and Punjabi, 111 00:06:06,961 --> 00:06:07,970 there is no description. 112 00:06:07,970 --> 00:06:10,806 This is a property which is used quite a lot, 113 00:06:10,806 --> 00:06:13,820 and you see that there is no description for it. 114 00:06:13,820 --> 00:06:17,876 And there is a surprising part that you could also have cases 115 00:06:17,876 --> 00:06:22,000 where there are descriptions, but there are no labels. 116 00:06:22,000 --> 00:06:25,293 For example, Ruffian, that has been shown here, 117 00:06:25,293 --> 00:06:30,116 again on property P31, there is a label that is missing. 118 00:06:30,116 --> 00:06:34,100 So this was the initial inspiration for this work 119 00:06:34,100 --> 00:06:37,486 when I started working on property analysis. 120 00:06:37,486 --> 00:06:44,272 I wanted to look at what aspects of properties, 121 00:06:44,272 --> 00:06:46,459 or what aspects of property 122 00:06:46,459 --> 00:06:49,569 that the whole flow chart that we have seen, 123 00:06:49,569 --> 00:06:51,316 is multilingual. 124 00:06:51,316 --> 00:06:53,048 So I wanted to look at, 125 00:06:53,048 --> 00:06:56,304 okay, we know that Wikidata is multilingual, 126 00:06:56,304 --> 00:06:58,984 and it's collaborative, that has been done. 127 00:06:58,984 --> 00:07:05,285 But are we really able to achieve a truly multilingual experience? 128 00:07:05,285 --> 00:07:09,054 That was the question behind the creation of WDProp. 129 00:07:09,054 --> 00:07:11,166 So you may ask why there are so many people 130 00:07:11,166 --> 00:07:14,600 who have worked on items, there are people who have worked on-- 131 00:07:14,600 --> 00:07:17,047 users, multilingual users and bots, et cetera, 132 00:07:17,047 --> 00:07:19,444 why you want to focus on properties? 133 00:07:19,444 --> 00:07:22,770 The answer is, I want to focus on properties 134 00:07:22,770 --> 00:07:25,738 because it's very, very less influenced by bots. 135 00:07:25,738 --> 00:07:28,581 You may have heard today or yesterday, 136 00:07:28,581 --> 00:07:31,895 many people said, "Okay, if you have translation 137 00:07:31,895 --> 00:07:36,761 in your local languages, and it has reached a very good number, 138 00:07:36,761 --> 00:07:39,227 you should ensure what type of translation it is. 139 00:07:39,227 --> 00:07:44,339 Is it just bots, which copies the name of a person to another language. 140 00:07:44,339 --> 00:07:47,242 Then is it really translation?" 141 00:07:47,242 --> 00:07:48,413 Okay, that's debatable. 142 00:07:48,413 --> 00:07:51,365 But, of course, there is an influence by bot, 143 00:07:51,365 --> 00:07:54,811 but in case of properties, there is not so much influence by bots, 144 00:07:54,811 --> 00:07:55,913 and that is a good part. 145 00:07:55,913 --> 00:08:00,706 That's why I focus on the bots part. 146 00:08:00,706 --> 00:08:05,552 So, as I said, when WDProp was created, 147 00:08:05,552 --> 00:08:09,451 it was to understand every aspect-- the proposal, the creation, translation. 148 00:08:09,451 --> 00:08:12,326 What are the templates that are available. 149 00:08:12,326 --> 00:08:16,232 Are these templates, for example, you said support, 150 00:08:16,232 --> 00:08:21,875 if a French person opens Wikidata, a Wikidata France translation page, 151 00:08:21,875 --> 00:08:28,039 can he see the word, [*soutien*], for that particular property proposal? 152 00:08:28,039 --> 00:08:29,373 Is it possible? 153 00:08:29,373 --> 00:08:33,125 So this type of things was needed. 154 00:08:33,125 --> 00:08:35,987 In the end, it was also about giving real-time statistics 155 00:08:35,987 --> 00:08:37,741 to the multilingual contributors. 156 00:08:37,741 --> 00:08:38,783 It's not about one time, 157 00:08:38,783 --> 00:08:42,178 it's like you just made it and published for one time-- no. 158 00:08:42,178 --> 00:08:45,434 You want people to get this data in real time. 159 00:08:45,434 --> 00:08:46,716 So what are we doing? 160 00:08:46,716 --> 00:08:52,065 So the goal of WDProp was to understand everything 161 00:08:52,065 --> 00:08:54,418 about Wikidata properties. 162 00:08:54,418 --> 00:08:56,955 So, label, aliases, description. 163 00:08:56,955 --> 00:09:01,348 So you have got all these three translated so the middle part where you say, 164 00:09:01,348 --> 00:09:05,618 this property is completely usable because all the three aspects 165 00:09:05,618 --> 00:09:08,984 have been translated. 166 00:09:08,984 --> 00:09:12,055 So let me just show you quickly, what is this WDProp, 167 00:09:12,055 --> 00:09:14,224 what I'm talking about. 168 00:09:14,224 --> 00:09:15,496 So this is the WDProp, 169 00:09:15,496 --> 00:09:19,726 it's available on *tools.wmflabs.org/wdprop/.* 170 00:09:19,726 --> 00:09:23,813 So you have a lot statistics and if I ask you some questions today, 171 00:09:23,813 --> 00:09:27,960 like, for example, "How many data types are there 172 00:09:27,960 --> 00:09:30,846 that are supported by Wikidata right now?" 173 00:09:30,846 --> 00:09:34,369 So if such questions, we do not know, 174 00:09:34,369 --> 00:09:37,549 sometimes because there are new data types that keep on coming. 175 00:09:37,549 --> 00:09:41,668 So this data, this is generated at real time, 176 00:09:41,668 --> 00:09:44,993 this creates the data structure and it will give you the answer. 177 00:09:44,993 --> 00:09:46,486 How many languages are there? 178 00:09:46,486 --> 00:09:50,194 Yes, of course, see that there are 313 languages. 179 00:09:50,194 --> 00:09:55,092 And then, for example, how many labels were translated. 180 00:09:55,092 --> 00:09:58,694 So you could see that the data is being fetched. 181 00:09:58,694 --> 00:10:00,242 I hope it comes. 182 00:10:01,512 --> 00:10:03,003 Okay, let's hope. (chuckles) 183 00:10:07,984 --> 00:10:11,621 Okay, I will take some other stuff as well. 184 00:10:11,621 --> 00:10:13,964 Browsing all properties by their time. 185 00:10:13,964 --> 00:10:17,079 Yes. So you see, this is count of translated labels, 186 00:10:17,079 --> 00:10:20,142 and you see all this data that is coming real time, 187 00:10:20,142 --> 00:10:21,781 and you can see that the labels 188 00:10:21,781 --> 00:10:26,881 are currently available in 6,804 languages in English, 189 00:10:26,881 --> 00:10:31,291 followed by Dutch, followed by Arabic, followed by Ukrainian, and then French. 190 00:10:31,291 --> 00:10:32,922 So this is real-time statistics. 191 00:10:32,922 --> 00:10:35,446 So you could also do the same for description, 192 00:10:35,446 --> 00:10:37,747 also do for aliases, et cetera. 193 00:10:37,747 --> 00:10:41,383 And you could get the overall translation statuses if you want. 194 00:10:41,383 --> 00:10:43,937 So there are some other things that we will discuss later, 195 00:10:43,937 --> 00:10:45,586 if time permits. 196 00:10:45,586 --> 00:10:50,132 But you could navigate all the different items 197 00:10:50,132 --> 00:10:52,367 on the left-hand side, 198 00:10:52,367 --> 00:10:54,127 and you could see there are a lot of things 199 00:10:54,127 --> 00:10:59,471 that could really help to see what things are happening in WDProp. 200 00:10:59,471 --> 00:11:03,591 So this is, for example, Wikidata properties, 201 00:11:03,591 --> 00:11:05,789 these are the properties that are currently available. 202 00:11:05,789 --> 00:11:10,039 But as I said some time back, properties could be deleted. 203 00:11:10,039 --> 00:11:13,121 And this, you see that these are the properties that were deleted, 204 00:11:13,121 --> 00:11:17,171 starting from P1, P2, P3, P4, P5, these have all been deleted, 205 00:11:17,171 --> 00:11:23,005 and you could get this thing just from the statistics board. 206 00:11:23,005 --> 00:11:24,947 And here, so same thing. 207 00:11:24,947 --> 00:11:29,938 Then, the next thing that interested me was to understand the translation pattern. 208 00:11:29,938 --> 00:11:33,388 So, for example, sometimes we feel that some languages-- 209 00:11:33,388 --> 00:11:36,514 so English is created first, and followed by maybe Dutch, 210 00:11:36,514 --> 00:11:38,201 or maybe French, 211 00:11:38,201 --> 00:11:40,701 and maybe after French, it could be Arabic. 212 00:11:40,701 --> 00:11:43,627 So these things could be interesting to know. 213 00:11:43,627 --> 00:11:48,596 So for that, we started to look at the idea of translation path-- 214 00:11:48,596 --> 00:11:51,607 exactly how things are translated. 215 00:11:51,607 --> 00:11:56,542 So again, if you go to the property page, you could click on any property. 216 00:11:56,542 --> 00:11:57,662 Sorry. 217 00:11:59,375 --> 00:12:01,053 Maybe I can show. 218 00:12:03,527 --> 00:12:06,497 So you could click on any property and you could just say, 219 00:12:06,497 --> 00:12:07,794 "Give me the translation path." 220 00:12:07,794 --> 00:12:11,487 It takes some time, but it will start bringing the data, 221 00:12:11,487 --> 00:12:15,434 because it's real time, so you get the data coming from all this. 222 00:12:15,434 --> 00:12:16,595 So you get the date, 223 00:12:16,595 --> 00:12:22,244 you get what things have been changed, when was something deleted, et cetera. 224 00:12:22,244 --> 00:12:23,848 Why it is important? 225 00:12:24,948 --> 00:12:29,401 For example, you see this is something that happened in 2017, 226 00:12:29,401 --> 00:12:31,955 and the label has been removed. 227 00:12:31,955 --> 00:12:33,893 This is the official website. 228 00:12:33,893 --> 00:12:38,944 So imagine you have removed the label from the official website-- 229 00:12:38,944 --> 00:12:39,978 sorry, this country-- 230 00:12:39,978 --> 00:12:43,357 so anybody who doesn't know P17, what it is, cannot even understand, 231 00:12:43,357 --> 00:12:45,971 because the label has been deleted by the person. 232 00:12:45,971 --> 00:12:47,915 So this type of vandalism exists. 233 00:12:47,915 --> 00:12:50,710 Another example where, completely, 234 00:12:50,710 --> 00:12:52,601 all the language labels have been deleted-- 235 00:12:52,601 --> 00:12:56,183 English, French, Spanish, German, everything has been deleted. 236 00:12:56,183 --> 00:12:58,329 There are no labels, there are no descriptions. 237 00:12:58,329 --> 00:13:01,033 So you could find these types of things from the translation path 238 00:13:01,033 --> 00:13:05,483 and just because of the color code, you could see what happened on what day, 239 00:13:05,483 --> 00:13:09,666 and you could check exactly, because it is also linked. 240 00:13:09,666 --> 00:13:14,261 If you click on any of this, you could also get a link to the revision, 241 00:13:14,261 --> 00:13:19,478 identify what exactly happened during that particular revision. 242 00:13:19,478 --> 00:13:21,309 So this is coming from revision history. 243 00:13:21,309 --> 00:13:25,311 So if you click on any of this, you get what exactly is happening 244 00:13:25,311 --> 00:13:28,567 in any particular revision. 245 00:13:28,567 --> 00:13:30,733 So how did we build it? 246 00:13:30,733 --> 00:13:31,923 Just if you come back, 247 00:13:31,923 --> 00:13:38,396 here, you see there is something called a comment on the right-hand side. 248 00:13:38,396 --> 00:13:42,602 You see there is something called added aliases, 249 00:13:42,602 --> 00:13:46,613 "added British English aliases," "changed Esperanto label," 250 00:13:46,613 --> 00:13:48,109 "added [io] label," et cetera. 251 00:13:48,109 --> 00:13:50,710 So we made use of this information, 252 00:13:50,710 --> 00:13:53,209 for example, for label description and aliases, 253 00:13:53,209 --> 00:13:55,507 if you add something, you have some sort of comment 254 00:13:55,507 --> 00:13:58,216 which starts with *wbsetlabel-add.* 255 00:13:58,216 --> 00:14:01,635 Or if it is updated, you have *wbsetlabel-set.* 256 00:14:01,635 --> 00:14:04,487 And if you remove something, you see it is removed. 257 00:14:04,487 --> 00:14:06,795 And based on this type of information, 258 00:14:06,795 --> 00:14:11,167 we were able to build such a translation path. 259 00:14:11,167 --> 00:14:16,557 Okay, this is good, but what happened is that this type of information, 260 00:14:16,557 --> 00:14:19,366 this type of things, just using the comment, 261 00:14:19,366 --> 00:14:23,932 it is useful for building real-time tools, just like what I showed before, WDProp, 262 00:14:23,932 --> 00:14:30,886 but it is very difficult to detect when there are multiple changes. 263 00:14:30,886 --> 00:14:34,871 For example, if you have seen bots activity on Wikidata, 264 00:14:34,871 --> 00:14:39,550 some bots make multiple labels in one single edit. 265 00:14:39,550 --> 00:14:42,037 In that case, you cannot find what happened 266 00:14:42,037 --> 00:14:45,878 because you do not have *wbsetlabel,* that particular language. 267 00:14:45,878 --> 00:14:49,254 So you do not have a set of languages along with your comment. 268 00:14:49,254 --> 00:14:53,703 So these are some problems if you want to use this type of approach. 269 00:14:54,603 --> 00:14:58,245 So what we did, we decided to collect the data, 270 00:14:58,245 --> 00:15:01,316 and we decided to publicly make this data available. 271 00:15:02,516 --> 00:15:06,246 And what we did, we wanted to make use of content. 272 00:15:06,246 --> 00:15:08,579 So what we did, we started with every revision, 273 00:15:08,579 --> 00:15:12,096 and we took the content of each revision. 274 00:15:12,096 --> 00:15:16,717 And we took the next revision, and we decided to find the difference 275 00:15:16,717 --> 00:15:19,885 between these two revisions, to find what exactly changes, 276 00:15:19,885 --> 00:15:21,822 which of the labels got changed. 277 00:15:21,822 --> 00:15:25,436 Because of that, we got much more interesting information, 278 00:15:25,436 --> 00:15:28,899 much more accurate information than the previous approach 279 00:15:28,899 --> 00:15:31,274 because it is very important for doing analysis. 280 00:15:31,274 --> 00:15:34,020 It is important that you make use of correct data. 281 00:15:34,020 --> 00:15:36,866 So you have four columns that were used here-- 282 00:15:36,866 --> 00:15:39,091 timestamp, property, language, type, et cetera. 283 00:15:39,091 --> 00:15:44,494 And you get this data in this format. It is publicly available. 284 00:15:44,494 --> 00:15:47,446 So what does this data give me? 285 00:15:47,446 --> 00:15:48,791 This data gives me information 286 00:15:48,791 --> 00:15:54,791 that currently almost 4,000 plus, 287 00:15:54,791 --> 00:15:57,291 4,500 properties 288 00:15:57,291 --> 00:15:59,917 have labels between 0 and 20. 289 00:15:59,917 --> 00:16:02,145 So there are a lot of properties 290 00:16:02,145 --> 00:16:07,107 who do not have more than 20 multilingual labels. 291 00:16:07,107 --> 00:16:10,888 And there are only 1,500 language properties 292 00:16:10,888 --> 00:16:12,857 that have been translated up to 40. 293 00:16:12,857 --> 00:16:18,699 And yesterday, if you were present during the talk of Lydia Pintscher, 294 00:16:18,699 --> 00:16:21,967 she talked about P18, so P18 is something here. 295 00:16:21,967 --> 00:16:25,332 So you can see there are only a couple of six or seven properties 296 00:16:25,332 --> 00:16:30,147 that are currently having all the-- 297 00:16:30,147 --> 00:16:35,092 P18 has 154 translations, just to give that idea. 298 00:16:35,092 --> 00:16:39,913 So there is one property which is having 154 multilingual labels. 299 00:16:39,913 --> 00:16:43,807 There are properties which have only one particular label. 300 00:16:43,807 --> 00:16:50,112 And the average number of labels is only 21, 301 00:16:50,112 --> 00:16:52,945 and the standard deviation is 20. 302 00:16:52,945 --> 00:16:55,967 Okay, what next we would like to say? 303 00:16:55,967 --> 00:16:59,970 So you have seen something similar in the real-time data. 304 00:16:59,970 --> 00:17:02,079 This is from the collected data. 305 00:17:02,079 --> 00:17:07,503 So this is what are the top languages that are coming up in the results. 306 00:17:07,503 --> 00:17:09,186 So these we have seen. 307 00:17:09,186 --> 00:17:13,314 But my next point is, are there combinations possible. 308 00:17:13,314 --> 00:17:16,522 For example, if there is French, there is Arabic. 309 00:17:16,522 --> 00:17:19,505 If there is Arabic, there is some other language. 310 00:17:19,505 --> 00:17:22,102 If there's French, there's Ukrainian, et cetera. 311 00:17:22,102 --> 00:17:26,093 Can we find such type of combinations in the translation data set? 312 00:17:26,093 --> 00:17:27,415 So, yes, it is possible. 313 00:17:27,415 --> 00:17:30,195 So if you see this count, this frequent itemsets-- 314 00:17:30,195 --> 00:17:32,134 so I've just shown seven of them-- 315 00:17:32,134 --> 00:17:35,315 you find that there are combinations that are possible. 316 00:17:36,901 --> 00:17:41,397 Okay, let us say, is there a possibility of having four labels, 317 00:17:41,397 --> 00:17:44,313 like if there is English, there's also possibility to find Dutch, 318 00:17:44,313 --> 00:17:45,794 Arabic, Ukrainian. 319 00:17:45,794 --> 00:17:48,041 If there is English, there's possibility to find Dutch, 320 00:17:48,041 --> 00:17:49,798 French, and Arabic, et cetera. 321 00:17:49,798 --> 00:17:52,763 You can also find a lot of combinations. 322 00:17:52,763 --> 00:17:53,907 Why it is important? 323 00:17:53,907 --> 00:17:57,432 Because it is important to know if, 324 00:17:57,432 --> 00:17:59,998 for example, if you have multilingual speakers 325 00:17:59,998 --> 00:18:03,664 who are contributors, who can speak multiple languages, 326 00:18:03,664 --> 00:18:07,402 if you're able to find any particular pattern 327 00:18:07,402 --> 00:18:12,556 that helps us to find that if you tell this person to translate, 328 00:18:12,556 --> 00:18:15,276 a new property is created to translate this label, 329 00:18:15,276 --> 00:18:19,213 because he already speaks multiple languages, 330 00:18:19,213 --> 00:18:21,669 we can suggest these things to the user. 331 00:18:21,669 --> 00:18:24,858 So let's just show you one example. 332 00:18:24,858 --> 00:18:27,257 This is a complete translation path 333 00:18:27,257 --> 00:18:29,774 that has obtained from different languages. 334 00:18:29,774 --> 00:18:35,001 So here, what we have done is we selected two small minority languages, 335 00:18:35,001 --> 00:18:39,293 like Tagalog and Kapampangan, 336 00:18:39,293 --> 00:18:42,602 which are minority languages from Philippines, 337 00:18:42,602 --> 00:18:46,156 and you see that there is a strong transfer 338 00:18:46,156 --> 00:18:49,645 between Tagalog and Kapampangan. 339 00:18:49,645 --> 00:18:51,784 So these types of things can be detected 340 00:18:51,784 --> 00:18:54,738 when you have such type of translation results. 341 00:18:54,738 --> 00:18:57,311 So that is another advantage. 342 00:18:57,311 --> 00:18:59,780 To conclude my work, I would like to say, 343 00:18:59,780 --> 00:19:05,128 this is important that we understand how properties are translated 344 00:19:05,128 --> 00:19:10,534 because if you want to extract data from Wikipedia, 345 00:19:10,534 --> 00:19:14,661 you need to know what are the words 346 00:19:14,661 --> 00:19:16,491 in the local languages that are being used. 347 00:19:16,491 --> 00:19:20,208 What is "image" in French, what is "image" in Punjabi, 348 00:19:20,208 --> 00:19:22,539 what is "image" in Hindi, or any other language. 349 00:19:22,539 --> 00:19:25,890 So that is important for importing data. 350 00:19:25,890 --> 00:19:30,023 And tomorrow, of course, if you are able to fetch this data, 351 00:19:30,023 --> 00:19:35,193 to Wikidata, we could also use new projects like Wikidata Bridge, 352 00:19:35,193 --> 00:19:38,963 which we could use to fill other info boxes, 353 00:19:38,963 --> 00:19:44,563 like multilingual Wikipedia articles, 354 00:19:44,563 --> 00:19:47,370 and this could be really helpful. 355 00:19:47,370 --> 00:19:51,238 So withe that, I would like to thank you, and if you have questions, 356 00:19:51,238 --> 00:19:54,321 I would be happy to answer them. 357 00:19:55,131 --> 00:19:57,218 (moderator) Anybody with questions? 358 00:19:58,842 --> 00:20:01,854 (audience applause) 359 00:20:08,387 --> 00:20:09,479 Yes? 360 00:20:11,988 --> 00:20:15,746 (man) So what you're doing is mainly analyzing how this-- 361 00:20:15,746 --> 00:20:17,389 - (John) Yes. - (man) ...is all happening? 362 00:20:17,389 --> 00:20:21,418 Do you know if there are initiatives or if there are tools 363 00:20:21,418 --> 00:20:25,331 which can help make this easier, like translation of properties? 364 00:20:25,331 --> 00:20:28,321 Yes. Tools, like, for example, what to translate 365 00:20:28,321 --> 00:20:32,995 from Wikimedia Foundation, is helpful, but I have not seen-- 366 00:20:32,995 --> 00:20:35,522 This is not currently integrated with Wikidata. 367 00:20:35,522 --> 00:20:41,672 What to translate is only integrated with certain languages on Wikipedia, 368 00:20:41,672 --> 00:20:44,485 but not on Wikidata. 369 00:20:44,485 --> 00:20:46,460 But that could be really interesting. 370 00:20:46,460 --> 00:20:50,165 Yes, thank you for bringing this up, because just imagine, 371 00:20:50,165 --> 00:20:54,490 if we know that a person has been labeling in multiple languages, 372 00:20:54,490 --> 00:20:56,842 and we also have this what to translate tool, 373 00:20:56,842 --> 00:21:00,007 and we have these statistics, we have this data 374 00:21:00,007 --> 00:21:04,657 coming from this type of property translation, 375 00:21:04,657 --> 00:21:09,423 it is easier to suggest to a person that new properties have been created, 376 00:21:09,423 --> 00:21:11,461 and then you could-- 377 00:21:11,461 --> 00:21:13,980 Right now it's not integrated to Wikidata. 378 00:21:15,674 --> 00:21:17,432 (moderator) Anybody else? 379 00:21:20,246 --> 00:21:23,315 (man 2) I have one question myself, that comes back to it, 380 00:21:23,315 --> 00:21:27,748 does anybody know of working lists on translating properties? 381 00:21:27,748 --> 00:21:28,769 Sorry? 382 00:21:28,769 --> 00:21:30,489 (man 2) Does anybody know of working lists 383 00:21:30,489 --> 00:21:31,695 about translating properties, 384 00:21:31,695 --> 00:21:37,751 like, I can imagine from your statistics, you could say, this is the top 100 385 00:21:37,751 --> 00:21:39,944 most widely used properties 386 00:21:39,944 --> 00:21:42,844 who lack translations in this and this language? 387 00:21:42,844 --> 00:21:47,494 No, there is, I think, there are ways by, 388 00:21:47,494 --> 00:21:51,112 for example, you could browse by data types, 389 00:21:51,112 --> 00:21:53,843 browse by property classes. 390 00:21:53,843 --> 00:21:57,398 For example, here is something called property classes 391 00:21:57,398 --> 00:22:00,743 where people have created projects-- 392 00:22:00,743 --> 00:22:03,272 it's taking time-- so you have projects, 393 00:22:03,272 --> 00:22:08,597 and you could say, how would I describe, what are the, for example, 394 00:22:08,597 --> 00:22:11,978 what are the properties that I could describe for this, 395 00:22:11,978 --> 00:22:14,183 for describing IEEE standard version? 396 00:22:14,183 --> 00:22:16,846 You need edition number, you need edition translation, et cetera. 397 00:22:16,846 --> 00:22:22,890 So if you have a targeted thing, you could search for what type of classes. 398 00:22:22,890 --> 00:22:25,853 For example, if you're working in GLAM or histories, 399 00:22:25,853 --> 00:22:29,652 you could say, what is history-related any document are there? 400 00:22:29,652 --> 00:22:32,715 So you could say, historical, and you could find historical. 401 00:22:32,715 --> 00:22:36,247 Okay, this is a property class, go to this property class. 402 00:22:36,247 --> 00:22:37,855 And, sorry, where is it? 403 00:22:37,855 --> 00:22:40,437 So it is having something called "Merimee ID." 404 00:22:40,437 --> 00:22:44,467 So people have been trying to use property classes 405 00:22:44,467 --> 00:22:45,913 to link objects. 406 00:22:45,913 --> 00:22:49,577 That helps if you're working on a particular project, 407 00:22:49,577 --> 00:22:52,342 and you could find that property's related to that. 408 00:22:52,342 --> 00:22:58,246 (man 2) But your tool could quite easily make a list of, let's say, 409 00:22:58,246 --> 00:23:02,746 the top 100 most widely used properties 410 00:23:02,746 --> 00:23:07,488 who haven't got, I don't know, Punjabi label, let's say? 411 00:23:07,488 --> 00:23:10,284 - (John) For that, I will just-- - (man 2) Which could be interesting. 412 00:23:10,284 --> 00:23:14,310 (John) Okay, tell me any language, for example, let us say, Netherlands, 413 00:23:14,310 --> 00:23:17,456 because it's performing very well. 414 00:23:17,456 --> 00:23:21,861 So I would say-- translated labels. 415 00:23:21,861 --> 00:23:24,011 So this is translate-- sorry. 416 00:23:30,491 --> 00:23:33,059 (mouse clicking) 417 00:23:36,747 --> 00:23:38,697 For example, Hindi. 418 00:23:38,697 --> 00:23:40,497 So here, what happens, 419 00:23:40,497 --> 00:23:44,335 here you just see any properties that need translation. 420 00:23:44,335 --> 00:23:47,473 So there are like 6,647 properties 421 00:23:47,473 --> 00:23:50,299 that need translation in a particular language. 422 00:23:50,299 --> 00:23:54,998 So you could click on any language that you want and get the data. 423 00:23:54,998 --> 00:23:58,778 And you could get the list of where people need support. 424 00:23:58,778 --> 00:24:03,345 So, this could be interesting to link with property usage, 425 00:24:03,345 --> 00:24:06,232 how many people, is it really top, is it under the top ten. 426 00:24:06,232 --> 00:24:08,871 So suggest those ten top hundred, in that language. 427 00:24:08,871 --> 00:24:11,282 That would be an interesting list. That's good. 428 00:24:11,852 --> 00:24:13,054 (man 3) Just what you asked, 429 00:24:13,054 --> 00:24:17,077 there is a list of top 100 most used properties on Wikidata. 430 00:24:17,077 --> 00:24:18,924 It's on Wikidata. 431 00:24:18,924 --> 00:24:21,432 So, yeah, it's there, 432 00:24:21,432 --> 00:24:25,942 under Wikidata Database Reports/ Top 100 Properties. 433 00:24:25,942 --> 00:24:31,083 So one thing could be that we could just link this and suggest it. 434 00:24:31,083 --> 00:24:33,349 (moderator) Could you maybe add the link to the etherpad, 435 00:24:33,349 --> 00:24:37,270 and then maybe, this information can come together. 436 00:24:37,270 --> 00:24:38,631 (John) Okay. 437 00:24:40,049 --> 00:24:42,007 (moderator) If there is no other questions, 438 00:24:42,007 --> 00:24:44,045 then we will conclude here. 439 00:24:44,045 --> 00:24:49,236 And we have two, three minutes break until we start with the next speaker. 440 00:24:49,236 --> 00:24:50,864 - Thanks. - (John) Thank you very much. 441 00:24:50,864 --> 00:24:53,041 (audience applause)