1 00:00:05,888 --> 00:00:09,312 Now, there are approximately 7,500 languages 2 00:00:09,312 --> 00:00:10,806 spoken on the planet today. 3 00:00:11,770 --> 00:00:13,808 Of those, it's estimated 4 00:00:13,808 --> 00:00:18,466 that about 70% are at risk of not surviving 5 00:00:18,466 --> 00:00:20,355 the end of the 21st century. 6 00:00:22,270 --> 00:00:24,266 Every time a language dies, 7 00:00:24,711 --> 00:00:26,622 it's severing a connection 8 00:00:26,622 --> 00:00:30,590 that has lasted for hundreds to thousands of years, 9 00:00:30,590 --> 00:00:34,816 to culture, to history, 10 00:00:35,320 --> 00:00:38,150 and to traditions, and to knowledge. 11 00:00:38,933 --> 00:00:42,250 The linguist Kenneth Hale once said 12 00:00:42,250 --> 00:00:44,183 that every time a language dies, 13 00:00:44,183 --> 00:00:46,794 it's like dropping an atom bomb on the Louvre, 14 00:00:49,377 --> 00:00:51,844 So the question is, 15 00:00:52,730 --> 00:00:54,800 why do languages die? 16 00:00:56,244 --> 00:01:00,155 Well, perhaps the simple answer might be 17 00:01:00,162 --> 00:01:03,051 that one could imagine authoritarian governments 18 00:01:03,051 --> 00:01:05,311 preventing people from speaking their native language, 19 00:01:05,844 --> 00:01:09,630 children being punished for speaking their language at school, 20 00:01:09,866 --> 00:01:12,911 or the government shutting down radio stations 21 00:01:12,923 --> 00:01:14,644 in the minority language. 22 00:01:15,044 --> 00:01:16,977 And this definitely happened in the past, 23 00:01:16,977 --> 00:01:19,088 and it still, to some extent, happens today. 24 00:01:19,616 --> 00:01:23,026 But the honest answer 25 00:01:23,026 --> 00:01:26,666 is that for the vast majority of the cases of language extinction, 26 00:01:27,296 --> 00:01:29,336 it's a much simpler 27 00:01:29,336 --> 00:01:32,555 and a much more easy-to-explain answer. 28 00:01:33,696 --> 00:01:36,222 The languages go extinct 29 00:01:36,220 --> 00:01:37,888 because they are not passed down 30 00:01:37,888 --> 00:01:39,733 from one generation to the next. 31 00:01:42,280 --> 00:01:43,866 Every single time a person who speaks 32 00:01:43,866 --> 00:01:46,088 a minority language has a child, 33 00:01:46,752 --> 00:01:50,355 they go through a calculus. 34 00:01:51,360 --> 00:01:52,800 They ask themselves, 35 00:01:53,660 --> 00:01:56,288 "Do I pass my language down to my child, 36 00:01:56,770 --> 00:02:01,311 or do I instead teach them only the majority language?" 37 00:02:01,311 --> 00:02:03,222 Essentially, there is a scale that goes on 38 00:02:03,900 --> 00:02:05,844 that they access in their heads, 39 00:02:06,720 --> 00:02:08,355 in which on one side 40 00:02:09,530 --> 00:02:11,733 every single time in their lives 41 00:02:11,737 --> 00:02:14,222 that they've had an opportunity to use their native language 42 00:02:14,866 --> 00:02:18,490 for communication, for access to traditional culture, 43 00:02:19,776 --> 00:02:21,748 a stone is placed on the left side. 44 00:02:22,228 --> 00:02:23,840 And every time that they find themselves 45 00:02:23,840 --> 00:02:25,755 unable to use their native language, 46 00:02:25,770 --> 00:02:27,955 and instead have to rely on the majority language, 47 00:02:27,958 --> 00:02:30,066 a stone is placed on the right side. 48 00:02:31,822 --> 00:02:34,800 Now, due to the strength and the dignity 49 00:02:34,800 --> 00:02:36,600 of being able to speak one's mother tongue, 50 00:02:36,600 --> 00:02:38,720 the stones on the left tend to be a bit heavier. 51 00:02:38,720 --> 00:02:42,048 But with enough stones on the right side, 52 00:02:42,560 --> 00:02:44,600 then eventually the scale tips, 53 00:02:44,600 --> 00:02:47,111 and then when a person makes the decision 54 00:02:47,111 --> 00:02:49,150 to pass their language down, 55 00:02:49,160 --> 00:02:50,622 they see their own language 56 00:02:50,622 --> 00:02:52,620 as more of a burden than a blessing. 57 00:02:55,200 --> 00:02:58,676 So the question is, how do we reverse this? 58 00:02:59,450 --> 00:03:01,777 First, we need to think about the fact that, 59 00:03:03,511 --> 00:03:04,968 for any given language, 60 00:03:04,970 --> 00:03:07,900 there are certain social spheres that they can be used in. 61 00:03:07,900 --> 00:03:08,976 So any language 62 00:03:08,976 --> 00:03:10,800 that's a mother tongue spoken today, 63 00:03:10,800 --> 00:03:12,990 can be used with one's family. 64 00:03:13,790 --> 00:03:16,671 A smaller set of languages can be used within one's community, 65 00:03:16,671 --> 00:03:18,660 a smaller set, maybe within one's region, 66 00:03:19,288 --> 00:03:22,155 and for a small handful of languages, 67 00:03:22,511 --> 00:03:24,488 they can be used for international communication. 68 00:03:25,824 --> 00:03:28,640 And then even across these spheres, 69 00:03:28,640 --> 00:03:31,712 there's the question of can someone use their language, 70 00:03:31,712 --> 00:03:35,533 for the purpose of education or business, 71 00:03:35,911 --> 00:03:37,600 or in technology? 72 00:03:39,136 --> 00:03:41,952 So, to better explain 73 00:03:43,200 --> 00:03:44,530 what I'm talking about here, 74 00:03:44,530 --> 00:03:46,393 I would like to use an anecdote. 75 00:03:48,400 --> 00:03:50,400 Let's say that you are about to go 76 00:03:50,400 --> 00:03:52,280 on your dream vacation to India, 77 00:03:53,155 --> 00:03:56,032 and you have an eight-hour layover in Istanbul. 78 00:03:57,312 --> 00:04:00,640 Now, you weren't necessarily planning on visiting Turkey, 79 00:04:00,896 --> 00:04:04,266 but with your layover and with a Turkish friend 80 00:04:04,266 --> 00:04:05,933 telling you about an amazing restaurant 81 00:04:05,933 --> 00:04:07,400 that's not too far from the airport, 82 00:04:07,800 --> 00:04:10,600 you say, "Hey, you know, maybe I'll stop by during my layover." 83 00:04:11,022 --> 00:04:12,920 So, you exit the airport, 84 00:04:13,950 --> 00:04:15,480 you get to your restaurant, 85 00:04:15,480 --> 00:04:17,020 and they hand you a menu, 86 00:04:17,020 --> 00:04:19,086 and the menu is entirely in Turkish. 87 00:04:20,170 --> 00:04:22,911 Now, let's say, for the point of this exercise, 88 00:04:22,911 --> 00:04:24,377 that you don't speak Turkish. 89 00:04:25,210 --> 00:04:26,535 What do you do? 90 00:04:28,155 --> 00:04:29,744 Well, best-case scenario, 91 00:04:29,744 --> 00:04:32,177 you find someone perhaps who can speak your native language, 92 00:04:32,383 --> 00:04:34,264 German, English, etc. 93 00:04:36,220 --> 00:04:37,997 But let's say it's not your lucky day 94 00:04:38,000 --> 00:04:41,066 and nobody in the restaurant can speak any German or any English. 95 00:04:42,000 --> 00:04:43,377 So what do you do? 96 00:04:43,377 --> 00:04:45,995 Well, if you are like me, and I imagine most of you, 97 00:04:45,995 --> 00:04:48,130 you've probably turned to a technological solution, 98 00:04:49,535 --> 00:04:52,351 machine translation or a digital dictionary, 99 00:04:52,607 --> 00:04:54,196 look up each word individually, 100 00:04:54,399 --> 00:04:57,733 and eventually order yourself a delicious Turkish meal. 101 00:04:59,970 --> 00:05:02,844 Now, let's imagine this scenario instead, 102 00:05:03,610 --> 00:05:06,400 in which you are the native speaker of a minority language. 103 00:05:07,455 --> 00:05:09,333 Let's say, Lower Sorbian. 104 00:05:09,333 --> 00:05:11,000 Lower Sorbian is an endangered language 105 00:05:11,000 --> 00:05:12,488 spoken here in Germany, 106 00:05:12,488 --> 00:05:16,888 about 130 kilometers to the southeast from here, 107 00:05:17,711 --> 00:05:20,857 that's spoken only by a few thousand people, mostly elderly. 108 00:05:22,810 --> 00:05:25,111 Now, let's say your mother tongue is Lower Sorbian. 109 00:05:25,370 --> 00:05:26,773 You end up in the restaurant. 110 00:05:26,773 --> 00:05:28,462 Now, of course, the odds of finding someone 111 00:05:28,462 --> 00:05:31,387 who speaks your native language in the restaurant is extraordinarily low. 112 00:05:32,280 --> 00:05:36,412 But, again, you can just go to a technological solution. 113 00:05:36,890 --> 00:05:39,333 However, for your native language, 114 00:05:39,333 --> 00:05:41,718 these technological solutions don't exist. 115 00:05:42,010 --> 00:05:44,991 You would have to rely on German or English 116 00:05:44,991 --> 00:05:47,488 as your pivot language into Turkish. 117 00:05:48,920 --> 00:05:52,382 Now, of course, you still end up getting your delicious Turkish meal, 118 00:05:52,382 --> 00:05:54,860 but you begin to think about how difficult this would have been 119 00:05:54,860 --> 00:05:57,170 if you were your grandfather, who spoke no German at all. 120 00:05:58,244 --> 00:05:59,840 Now, this is just a small incident, 121 00:05:59,844 --> 00:06:04,787 but it's going to place a stone on the right side of that scale, 122 00:06:05,310 --> 00:06:07,053 and make you think perhaps 123 00:06:07,053 --> 00:06:09,898 maybe when I have children or maybe when I have another child, 124 00:06:10,943 --> 00:06:14,726 the burden that you went through with this 125 00:06:14,726 --> 00:06:17,133 may not be worth it to keep your language. 126 00:06:19,391 --> 00:06:21,284 And imagine if this was a scenario 127 00:06:21,284 --> 00:06:26,177 that was of significantly more importance, 128 00:06:26,177 --> 00:06:28,380 such as, for example, being in a hospital. 129 00:06:31,133 --> 00:06:36,161 Now, this is the point in which we can help-- 130 00:06:36,790 --> 00:06:40,242 by we, I mean me and you in this room can help. 131 00:06:41,400 --> 00:06:43,355 We have the tools to be able to help this. 132 00:06:45,155 --> 00:06:47,355 If technological tools are available for people 133 00:06:47,355 --> 00:06:49,350 who speak minority and underserved languages, 134 00:06:50,555 --> 00:06:54,022 it puts a little finger on the scale, on the left side of the scale. 135 00:06:54,022 --> 00:06:55,776 Someone doesn't necessarily have to think 136 00:06:55,776 --> 00:06:57,680 that they have to rely on a minority language 137 00:06:57,680 --> 00:06:59,488 in order to interact with the outside world, 138 00:07:00,351 --> 00:07:05,111 because it opens the social spheres 139 00:07:05,111 --> 00:07:06,328 a little bit more. 140 00:07:07,910 --> 00:07:10,333 So, of course, the ideal solution 141 00:07:10,333 --> 00:07:13,022 is that we have machine translation in every language in the world. 142 00:07:13,022 --> 00:07:16,831 But, unfortunately, that's just not feasible. 143 00:07:16,831 --> 00:07:19,800 Machine translation requires large corpuses of text, 144 00:07:19,800 --> 00:07:21,088 and for many of these languages 145 00:07:21,088 --> 00:07:23,080 that are endangered or underserved, 146 00:07:23,391 --> 00:07:25,439 such data is simply not available. 147 00:07:26,309 --> 00:07:28,279 Some of them aren't even commonly written 148 00:07:29,000 --> 00:07:32,825 and thus getting enough data to make a machine translation engine 149 00:07:32,825 --> 00:07:34,390 is unlikely. 150 00:07:34,390 --> 00:07:38,060 But what is available is lexical data. 151 00:07:40,244 --> 00:07:43,444 Through the work of many linguists 152 00:07:43,444 --> 00:07:45,440 over the past few hundred years, 153 00:07:47,777 --> 00:07:49,728 dictionaries and grammars have been produced 154 00:07:49,728 --> 00:07:51,680 for most of the world's languages. 155 00:07:53,920 --> 00:07:56,511 But, unfortunately, most of these works 156 00:07:56,511 --> 00:08:00,644 are not accessible or available to the world, 157 00:08:00,647 --> 00:08:03,533 let alone to speakers of these minority languages. 158 00:08:04,522 --> 00:08:06,377 And it's not an intentional process, 159 00:08:06,377 --> 00:08:07,910 a lot of times it's simply because 160 00:08:07,910 --> 00:08:10,785 the initial print run of these dictionaries was small, 161 00:08:11,155 --> 00:08:12,543 and the only copies 162 00:08:12,543 --> 00:08:16,244 are moldering away in a university library somewhere. 163 00:08:17,511 --> 00:08:21,333 But we have the ability to take that data 164 00:08:21,333 --> 00:08:23,330 and make it accessible to the world. 165 00:08:24,133 --> 00:08:28,377 The Wikimedia Foundation is one of the best organizations, 166 00:08:28,377 --> 00:08:30,555 I would say *the* best organization in the world, 167 00:08:30,975 --> 00:08:33,396 for getting data available 168 00:08:33,396 --> 00:08:36,688 to the vast majority of the population of this planet. 169 00:08:38,533 --> 00:08:40,134 So let's work on that. 170 00:08:41,000 --> 00:08:43,222 So to explain a little bit 171 00:08:43,224 --> 00:08:45,050 about what we've been doing in this regard, 172 00:08:45,311 --> 00:08:48,127 I'd like to introduce my organization, PanLex, 173 00:08:48,711 --> 00:08:51,888 which is an organization that is attempting 174 00:08:51,888 --> 00:08:54,146 to collect lexical data for this purpose. 175 00:08:54,780 --> 00:08:56,830 We got started about 12 years ago 176 00:08:56,830 --> 00:08:59,600 at the University of Washington, as a research project. 177 00:08:59,600 --> 00:09:01,088 The idea behind it 178 00:09:01,088 --> 00:09:03,990 was to show that inferred translations 179 00:09:04,377 --> 00:09:07,125 could create an effective translation device, 180 00:09:07,125 --> 00:09:09,088 essentially a lexical translation device. 181 00:09:09,088 --> 00:09:12,223 This is an example from PanLex data itself. 182 00:09:12,680 --> 00:09:14,057 This is showing how to translate 183 00:09:14,066 --> 00:09:17,805 the word "ev" in Turkish, which means house, 184 00:09:17,805 --> 00:09:19,555 to Lower Sorbian, 185 00:09:19,555 --> 00:09:21,201 the language I was referring to earlier. 186 00:09:21,212 --> 00:09:23,190 So it's unlikely to find 187 00:09:24,333 --> 00:09:26,200 Turkish to Lower Sorbian dictionaries, 188 00:09:26,200 --> 00:09:28,244 but by passing it through 189 00:09:28,244 --> 00:09:30,240 many, many different intermediate languages, 190 00:09:30,488 --> 00:09:32,600 you can create effective translations. 191 00:09:34,333 --> 00:09:36,911 So, once this was shown in the research projects, 192 00:09:36,911 --> 00:09:39,631 the founder of PanLex, Dr. Jonathan Pool, 193 00:09:40,711 --> 00:09:43,666 decided, "Well, you know, why not actually just do this?" 194 00:09:43,666 --> 00:09:45,470 So he started a non-profit 195 00:09:45,470 --> 00:09:48,522 to collect as much lexical data as possible and make it accessible. 196 00:09:48,911 --> 00:09:51,066 That's what we've been doing for the past 12 years. 197 00:09:51,066 --> 00:09:54,516 In that time, we've collected thousands and thousands of dictionaries, 198 00:09:54,516 --> 00:09:56,479 and extracted lexical data out of them 199 00:09:56,479 --> 00:10:01,340 and compiled a database that allows inferred lexical translation 200 00:10:01,340 --> 00:10:03,755 across any of-- 201 00:10:03,755 --> 00:10:05,866 Our current count is around 5,500 202 00:10:05,860 --> 00:10:07,955 of the 7,500 languages in the world. 203 00:10:08,511 --> 00:10:10,685 And, of course, 204 00:10:10,685 --> 00:10:12,221 we're constantly trying to expand that 205 00:10:12,221 --> 00:10:14,784 and expand the data on each individual language. 206 00:10:17,220 --> 00:10:21,111 So, the next question is, 207 00:10:22,079 --> 00:10:25,663 what can we do to work together on this? 208 00:10:26,680 --> 00:10:28,931 We, at PanLex, have been extremely excited to watch 209 00:10:28,931 --> 00:10:31,260 the development on lexical data, 210 00:10:31,260 --> 00:10:34,175 that Wikidata has been working on lately. 211 00:10:35,155 --> 00:10:37,548 It's very fascinating to see organizations 212 00:10:37,550 --> 00:10:39,476 that are working in a very similar sphere, 213 00:10:39,476 --> 00:10:41,183 but in different aspects. 214 00:10:41,535 --> 00:10:44,351 And we are extremely excited to see 215 00:10:44,733 --> 00:10:46,466 the results of this from Wikidata. 216 00:10:46,466 --> 00:10:51,144 And also we are looking forward to collaborating with Wikidata. 217 00:10:53,844 --> 00:10:56,271 I think that the special skills 218 00:10:56,271 --> 00:10:58,022 that we've developed over the past 12 years, 219 00:10:58,022 --> 00:11:01,555 with not just collecting lexical data, but also in database design, 220 00:11:01,557 --> 00:11:03,908 could be extremely useful for Wikidata. 221 00:11:03,910 --> 00:11:07,111 And on the other side, I think that-- 222 00:11:08,415 --> 00:11:10,975 I especially am excited about Wikidata's 223 00:11:11,743 --> 00:11:14,549 ability to do crowdsourcing of data. 224 00:11:15,129 --> 00:11:18,047 PanLex, currently, our sources are entirely 225 00:11:18,399 --> 00:11:20,959 printed lexical sources or other types of lexical sources, 226 00:11:21,170 --> 00:11:22,662 but we don't do any crowdsourcing. 227 00:11:22,670 --> 00:11:24,920 We simply don't have the infrastructure for it available 228 00:11:24,920 --> 00:11:26,931 and of course, the Wikimedia Foundation 229 00:11:26,933 --> 00:11:28,930 is the world expert in crowdsourcing. 230 00:11:31,848 --> 00:11:33,728 I'm really looking forward to seeing exactly 231 00:11:33,733 --> 00:11:35,680 how we can apply these skills together. 232 00:11:38,533 --> 00:11:41,600 But, overall, I think the main thing to think about this 233 00:11:41,600 --> 00:11:43,457 is that when we were working on these things, 234 00:11:43,461 --> 00:11:45,133 it's minute detail. 235 00:11:45,133 --> 00:11:47,533 We're sitting around looking at grammatical forms, 236 00:11:47,533 --> 00:11:51,911 or paging our way through dictionaries, ancient dictionaries, 237 00:11:51,915 --> 00:11:53,977 or sometimes recently published dictionaries 238 00:11:53,977 --> 00:11:57,466 and getting into written forms of words, 239 00:11:57,466 --> 00:11:59,994 and it feels very close up. 240 00:11:59,994 --> 00:12:01,535 But, occasionally, we need to remember 241 00:12:01,535 --> 00:12:02,556 to take a step back 242 00:12:02,556 --> 00:12:04,951 in that, even though what we're doing 243 00:12:06,231 --> 00:12:08,831 can feel even mundane at times, 244 00:12:10,091 --> 00:12:11,957 the work we're doing is extremely important. 245 00:12:13,010 --> 00:12:15,666 This is, in my opinion, the absolute best way 246 00:12:15,666 --> 00:12:18,862 that we can support endangered languages 247 00:12:18,862 --> 00:12:21,488 and make sure that the linguistic diversity of the planet 248 00:12:21,488 --> 00:12:25,730 is preserved up to the end of this century or longer. 249 00:12:26,444 --> 00:12:29,644 It's entirely possible that the work that we're doing today 250 00:12:29,644 --> 00:12:32,577 may result in languages 251 00:12:32,577 --> 00:12:35,355 being preserved and passed down, 252 00:12:35,355 --> 00:12:36,955 and not going extinct. 253 00:12:38,527 --> 00:12:40,605 So just to remember 254 00:12:40,605 --> 00:12:43,207 that even if you're sitting around on your computer 255 00:12:43,207 --> 00:12:44,480 editing an individual entry 256 00:12:44,480 --> 00:12:49,707 and adding the data form of a small minority language 257 00:12:49,707 --> 00:12:51,796 for every single noun, 258 00:12:51,800 --> 00:12:54,577 the little thing that you're doing right now, 259 00:12:54,577 --> 00:12:57,528 might actually be partially responsible 260 00:12:57,533 --> 00:12:59,155 for making sure that language survives, 261 00:12:59,155 --> 00:13:01,060 until the end of the century or longer. 262 00:13:02,591 --> 00:13:03,703 Thank you very much, 263 00:13:03,703 --> 00:13:05,717 and I'd like to open the floor to questions. 264 00:13:06,222 --> 00:13:08,373 (applause) 265 00:13:23,688 --> 00:13:24,977 (woman 1) Thank you. 266 00:13:24,977 --> 00:13:26,701 - Thank you for your talk. - Thank you. 267 00:13:26,701 --> 00:13:28,777 (woman 1) I just have a question about dictionaries. 268 00:13:28,777 --> 00:13:31,107 You said that you work with printed dictionaries? 269 00:13:31,107 --> 00:13:32,312 - Yes. - (woman 1) So my question 270 00:13:32,312 --> 00:13:34,508 is what do you take from those dictionaries 271 00:13:34,511 --> 00:13:38,222 and if there's any copyright thing you have to deal with? 272 00:13:38,222 --> 00:13:41,060 I anticipated this to be the first question that I would get. 273 00:13:41,060 --> 00:13:42,827 (laughter) 274 00:13:42,827 --> 00:13:46,358 So, first off, for PanLex, 275 00:13:46,358 --> 00:13:50,244 we have, according to our legal resources that we have consulted, 276 00:13:52,734 --> 00:13:57,466 whereas the arrangement and organization of a dictionary is copyrightable, 277 00:13:57,466 --> 00:14:03,260 the translation itself is not considered copyrightable. 278 00:14:04,170 --> 00:14:05,808 A good example is like, for example, 279 00:14:05,808 --> 00:14:10,525 a phone book is considered, at least according to US law, 280 00:14:10,956 --> 00:14:11,965 copyrightable. 281 00:14:11,965 --> 00:14:16,800 But saying that person X's phone number is digits D 282 00:14:16,800 --> 00:14:18,360 is not copyrightable. 283 00:14:21,666 --> 00:14:23,444 So like I said, 284 00:14:23,444 --> 00:14:25,311 according to our legal scholars, 285 00:14:25,311 --> 00:14:27,333 this is how we can deal with this. 286 00:14:27,333 --> 00:14:30,666 But even if that's not a solid enough legal argument, 287 00:14:30,666 --> 00:14:32,063 one important thing to remember 288 00:14:32,063 --> 00:14:38,269 is that the vast majority of these lexical data, 289 00:14:39,355 --> 00:14:40,530 is actually out of copyright. 290 00:14:40,530 --> 00:14:42,822 A significant number of these are out of copyright 291 00:14:42,822 --> 00:14:44,333 and thus can be used without [end]. 292 00:14:44,333 --> 00:14:46,783 And the other thing is that oftentimes, for example, 293 00:14:47,311 --> 00:14:49,644 if we're working with a recently made print dictionary, 294 00:14:49,640 --> 00:14:51,577 rather than trying to scan it and OCR it, 295 00:14:51,577 --> 00:14:53,439 we just email the person who made it. 296 00:14:53,439 --> 00:14:57,600 And it turns out that most linguists are really excited 297 00:14:57,600 --> 00:14:59,600 that their data can be made accessible. 298 00:14:59,600 --> 00:15:01,267 And so they're like, "Sure, please, 299 00:15:01,267 --> 00:15:03,273 just put it all in there and make it accessible." 300 00:15:05,533 --> 00:15:08,424 So like I said, we have, at least, according to our legal opinions, 301 00:15:08,424 --> 00:15:09,466 we have the ability, 302 00:15:09,466 --> 00:15:11,177 but even if you don't want to go with that, 303 00:15:11,177 --> 00:15:15,644 it's very easy to get the data publicly accessible. 304 00:15:26,288 --> 00:15:28,470 - (man 1) Thank you. Hi. - Hi. 305 00:15:28,470 --> 00:15:29,830 (man 1) Can you say a little more 306 00:15:29,830 --> 00:15:35,031 about how the person who speaks Lower Sorbian is accessing the data. 307 00:15:35,031 --> 00:15:38,355 Like specifically how that information is getting to them 308 00:15:38,357 --> 00:15:40,977 and how that might help to convince them 309 00:15:40,977 --> 00:15:42,800 to either try out the-- 310 00:15:42,800 --> 00:15:44,680 Great question and this is actually 311 00:15:44,680 --> 00:15:46,266 one that I think about a lot as well, 312 00:15:46,266 --> 00:15:49,759 because I think that when we talk about data access, 313 00:15:50,270 --> 00:15:53,244 there's actually a multiple step of this, multiple steps. 314 00:15:53,244 --> 00:15:56,288 One is, of course, data preservation, make sure the data doesn't go away. 315 00:15:56,288 --> 00:15:58,911 Secondly, is make sure it's interoperable 316 00:15:59,177 --> 00:16:01,844 and can be used. 317 00:16:01,844 --> 00:16:05,370 And thirdly is make sure that it's available. 318 00:16:05,631 --> 00:16:07,333 So in PanLex's case, 319 00:16:07,333 --> 00:16:09,755 we have an API that can be used, 320 00:16:09,755 --> 00:16:11,888 but, obviously, that can't be used by an end user 321 00:16:11,888 --> 00:16:14,847 But we've also developed interfaces. 322 00:16:15,155 --> 00:16:19,727 And so, for example, if you go to *translate.panlex.org*, 323 00:16:19,728 --> 00:16:22,711 you can do translations on our database. 324 00:16:22,711 --> 00:16:25,864 If you want to mess around with the API, just go to *dev.panlex.org,* 325 00:16:25,866 --> 00:16:29,222 and you can find a bunch of stuff on the API, or just *api.panlex.org*. 326 00:16:30,950 --> 00:16:32,542 But there's another step too, 327 00:16:32,542 --> 00:16:36,577 which is that even if you make all of your data completely accessible 328 00:16:36,570 --> 00:16:40,533 with tools that are super useful to be able to access it, 329 00:16:41,210 --> 00:16:43,244 if you don't actually promote the tools, 330 00:16:43,244 --> 00:16:45,058 then people won't actually be able to use it. 331 00:16:45,058 --> 00:16:47,177 And this is honestly kind of a... 332 00:16:48,827 --> 00:16:51,044 the thing that isn't talked about enough, 333 00:16:51,044 --> 00:16:52,955 and I don't have a good answer for it. 334 00:16:52,955 --> 00:16:54,800 How do we make sure that-- 335 00:16:55,022 --> 00:16:56,933 For example, l only fairly recently, 336 00:16:56,933 --> 00:16:59,647 only a few years ago got acquainted with Wikidata, 337 00:16:59,647 --> 00:17:02,463 and it's exactly the kind of thing that I'm interested in. 338 00:17:02,970 --> 00:17:07,177 So, how do we promote ourselves to others? 339 00:17:07,177 --> 00:17:08,780 I'm leaving that as an open question. 340 00:17:08,780 --> 00:17:10,800 Like I said, I don't have a good answer for this. 341 00:17:10,800 --> 00:17:12,888 But, of course, in order to do that, 342 00:17:12,888 --> 00:17:14,880 we still need to accomplish the first few steps. 343 00:17:22,133 --> 00:17:24,777 (man 2) If we want to have machine translation, 344 00:17:24,777 --> 00:17:27,822 don't we need a translation memory? 345 00:17:27,827 --> 00:17:30,666 I'm not sure that the individual words 346 00:17:30,666 --> 00:17:32,918 that we put into Wikidata, 347 00:17:32,918 --> 00:17:36,558 these short phrases that we put into Wikidata, 348 00:17:36,558 --> 00:17:41,130 either as ordinary Wikidata items or as Wikidata lexemes, 349 00:17:41,130 --> 00:17:43,953 are sufficient to do a proper translation. 350 00:17:43,955 --> 00:17:46,600 We need to have full sentences, for example, for-- 351 00:17:46,772 --> 00:17:48,320 (Benjamin) Yeah, absolutely. 352 00:17:48,577 --> 00:17:51,422 (man 2) And where do we get this data structure? 353 00:17:51,422 --> 00:17:55,177 I'm not sure that, currently, 354 00:17:55,177 --> 00:17:59,533 Wikidata is able to very well handle 355 00:17:59,533 --> 00:18:03,066 the issue of a translation memory, 356 00:18:04,324 --> 00:18:05,965 *translatewiki.net*, 357 00:18:05,965 --> 00:18:09,490 for getting into that gap of... 358 00:18:12,111 --> 00:18:14,993 Should we do anything in that respect, or should we-- 359 00:18:15,000 --> 00:18:17,133 Yeah, and I really appreciate your question. 360 00:18:17,135 --> 00:18:18,715 I touched on this a little bit earlier, 361 00:18:18,715 --> 00:18:20,361 but I'd love to reiterate it. 362 00:18:21,356 --> 00:18:24,955 This is precisely the reason that PanLex works in lexical data 363 00:18:24,955 --> 00:18:27,030 and why I'm excited about lexical data, 364 00:18:27,030 --> 00:18:29,935 as opposed to-- not as opposed to, but in addition 365 00:18:29,935 --> 00:18:35,207 to machine translation engines and machine translation in general. 366 00:18:35,900 --> 00:18:39,200 As you said, machine translation requires a specific kind of data, 367 00:18:39,740 --> 00:18:43,123 and that data is not available for most of the world's languages. 368 00:18:43,123 --> 00:18:44,966 For the vast majority of the world's languages, 369 00:18:44,966 --> 00:18:46,379 that simply is not available. 370 00:18:46,650 --> 00:18:48,447 But that doesn't mean we should just give up. 371 00:18:48,447 --> 00:18:49,627 Like why? 372 00:18:51,260 --> 00:18:54,444 If I needed to translate my Turkish restaurant menu, 373 00:18:54,755 --> 00:18:59,360 then lexical translation will likely be an exceptionally good tool for that. 374 00:18:59,360 --> 00:19:01,715 Now, I'm not saying that you can use lexical translation 375 00:19:01,715 --> 00:19:04,600 to do perfect paragraph to paragraph translation. 376 00:19:04,600 --> 00:19:06,866 When I say lexical translation, I mean word to word 377 00:19:06,866 --> 00:19:09,670 and word to word translation can be extremely useful, 378 00:19:12,231 --> 00:19:14,708 It's funny to think about it, but we didn't really have access 379 00:19:14,708 --> 00:19:16,620 to really good machine translation. 380 00:19:16,620 --> 00:19:20,191 Everyone didn't have access to that until fairly recently. 381 00:19:20,191 --> 00:19:23,649 And we still got by with dictionaries, 382 00:19:23,649 --> 00:19:27,687 and they're an incredibly good resource. 383 00:19:28,311 --> 00:19:31,288 And the data is available, so why not make it available 384 00:19:31,288 --> 00:19:34,377 to the world at large and to the speakers of these languages? 385 00:19:36,422 --> 00:19:38,666 (woman 2) Hi, what mechanisms do you have in place 386 00:19:38,666 --> 00:19:40,666 when the community itself--I'm over here. 387 00:19:40,666 --> 00:19:43,253 - Where are you? Okay, right. - (woman 2) Yeah, sorry. (laughs) 388 00:19:43,253 --> 00:19:44,577 ...when the community itself 389 00:19:44,577 --> 00:19:47,320 doesn't want part of their data in PanLex? 390 00:19:47,320 --> 00:19:48,933 Great question. 391 00:19:48,933 --> 00:19:51,955 So the way that we work with that 392 00:19:51,955 --> 00:19:56,287 is that if a dictionary is published and made publicly available, 393 00:19:56,666 --> 00:19:58,133 that's a good indication. 394 00:19:58,133 --> 00:20:02,400 Like you could buy it in a store or at a university library, 395 00:20:02,400 --> 00:20:04,690 or a public library anyone can access. 396 00:20:04,690 --> 00:20:08,080 That's a good indication that that decision has been made. 397 00:20:08,080 --> 00:20:11,577 (woman 2) [inaudible] 398 00:20:15,740 --> 00:20:18,266 (man 3) Please, [inaudible], could you speak in the microphone? 399 00:20:19,295 --> 00:20:20,447 Can you say it again? 400 00:20:20,447 --> 00:20:23,307 (woman 2) Linguists don't always have the permission of the community. 401 00:20:23,307 --> 00:20:24,387 In order to publish things, 402 00:20:24,387 --> 00:20:27,533 they oftentimes publish things without the consent of the community. 403 00:20:27,533 --> 00:20:29,577 And that's absolutely true. 404 00:20:29,577 --> 00:20:32,533 I would say that is a-- 405 00:20:32,533 --> 00:20:34,422 That does happen. 406 00:20:34,422 --> 00:20:36,770 I would say it's generally a small minority of cases, 407 00:20:36,770 --> 00:20:40,955 mostly confined to generally North America, 408 00:20:40,955 --> 00:20:43,355 although sometimes South American languages as well. 409 00:20:44,765 --> 00:20:46,488 It's something we have to take into account. 410 00:20:46,488 --> 00:20:49,288 If we were to receive word, for example, 411 00:20:49,288 --> 00:20:52,377 that the data that is in PanLex 412 00:20:52,377 --> 00:20:56,330 should not be accessed by the greater world, 413 00:20:56,330 --> 00:20:58,040 then, of course, we would remove it. 414 00:20:58,040 --> 00:20:59,310 (woman 2) Good, good. 415 00:21:01,281 --> 00:21:02,451 That doesn't mean, of course, 416 00:21:02,451 --> 00:21:04,391 that we'll listen to copyright rules necessarily 417 00:21:04,391 --> 00:21:06,542 but we will listen to traditional communities, 418 00:21:06,542 --> 00:21:08,157 and that's the major difference. 419 00:21:08,157 --> 00:21:10,252 (woman 2) Yeah, that's what I'm referring to. 420 00:21:15,022 --> 00:21:16,755 It brings up a really interesting point, 421 00:21:16,755 --> 00:21:18,350 which is that 422 00:21:18,844 --> 00:21:22,244 sometimes it's a really big question of who speaks for a language. 423 00:21:23,000 --> 00:21:27,911 I had some experience actually visiting the American Southwest 424 00:21:27,911 --> 00:21:29,755 and working with some groups, 425 00:21:29,777 --> 00:21:32,288 who work on indigenous, the Pueblo languages out there. 426 00:21:36,053 --> 00:21:38,044 So there is approximately 427 00:21:38,044 --> 00:21:40,220 six Pueblo languages, depending on how you slice it, 428 00:21:40,220 --> 00:21:41,955 spoken in that area. 429 00:21:41,955 --> 00:21:44,022 But they are divided amongst 18 different Pueblos 430 00:21:44,320 --> 00:21:47,066 and each one has their own tribal government, 431 00:21:47,066 --> 00:21:50,022 and each government may have a different opinion 432 00:21:50,022 --> 00:21:54,007 on whether their language should be accessible to outsiders or not. 433 00:21:56,626 --> 00:21:58,170 Like, for example, Zuni Pueblo, 434 00:21:58,170 --> 00:22:01,472 it's a single Pueblo that speaks Zuni language. 435 00:22:02,923 --> 00:22:05,274 And they're really big on their language going everywhere, 436 00:22:05,274 --> 00:22:07,694 they put it on the street signs and everything, it's great. 437 00:22:07,694 --> 00:22:10,637 But for some of the other languages, 438 00:22:10,644 --> 00:22:13,051 you might have one group that says, 439 00:22:13,051 --> 00:22:15,866 "Yeah, we don't want our language being accessed by outsiders." 440 00:22:15,871 --> 00:22:18,838 But then you have the neighboring Pueblo who speaks the same language say, 441 00:22:18,838 --> 00:22:21,666 "We really want our language accessible to outsiders 442 00:22:21,666 --> 00:22:24,088 in using these technological tools, 443 00:22:24,088 --> 00:22:26,560 because we want our language to be able to continue on." 444 00:22:26,560 --> 00:22:29,488 And it raises a really interesting ethical question. 445 00:22:29,488 --> 00:22:31,651 Because if you default by saying, 446 00:22:31,651 --> 00:22:34,622 "Fine, I'm cutting it off because this group said we should cut it off"-- 447 00:22:34,622 --> 00:22:36,711 aren't you also disservicing the second group 448 00:22:36,711 --> 00:22:39,360 because they actively want you to rule out these things. 449 00:22:39,360 --> 00:22:42,755 So I don't think this is a question that has an easy answer. 450 00:22:42,755 --> 00:22:44,955 But I would say at least in terms of PanLex. 451 00:22:44,955 --> 00:22:48,938 And for the record, we actually haven't encountered this yet, 452 00:22:48,938 --> 00:22:50,407 that I'm aware of. 453 00:22:50,933 --> 00:22:52,920 Now, that could be partially because... 454 00:22:53,666 --> 00:22:55,444 Getting back to his question, 455 00:22:55,666 --> 00:22:57,790 we may need to promote more. (chuckles) 456 00:22:58,660 --> 00:23:02,155 But, in general, as far as I know, 457 00:23:02,155 --> 00:23:04,488 we have not had this come up. 458 00:23:04,488 --> 00:23:06,871 But our game plan for this 459 00:23:06,871 --> 00:23:10,975 is if a community says they don't want their data in a database, 460 00:23:10,975 --> 00:23:12,095 then we remove it. 461 00:23:12,095 --> 00:23:14,916 (woman 2) Because we have come up with it in Wikidata and Wikipedia... 462 00:23:14,916 --> 00:23:16,140 - You have? - (woman 2) ...in comments. 463 00:23:16,140 --> 00:23:17,407 - Really? - (woman 2) It's been a problem. 464 00:23:17,407 --> 00:23:20,488 Yeah, I can imagine especially in comments for photos or certain things. 465 00:23:20,488 --> 00:23:21,900 (woman 2) Correct. 466 00:23:27,177 --> 00:23:33,170 (man 4) Hi, I had a question about the crowdsourcing aspect of this. 467 00:23:34,087 --> 00:23:36,644 As far as going in and asking a community 468 00:23:36,654 --> 00:23:40,480 to annotate or add data for a dataset, 469 00:23:40,480 --> 00:23:44,200 one of the things that's a little intimidating is like, 470 00:23:44,711 --> 00:23:49,244 as an editor, I can only see what things are missing. 471 00:23:49,244 --> 00:23:53,242 But if I'm going to spend time on things, having an idea, 472 00:23:53,582 --> 00:23:56,672 there's a list of high priority items, 473 00:23:57,755 --> 00:24:01,198 that's, I guess, very motivating in this aspect. 474 00:24:01,200 --> 00:24:04,222 And I was curious if you had a system 475 00:24:04,222 --> 00:24:07,866 which is, essentially, like, we know the gaps in our own data, 476 00:24:07,866 --> 00:24:12,088 we have linguistic evidence to know that these are the ones 477 00:24:12,088 --> 00:24:15,530 that if we had annotated, these would be the high impact drivers. 478 00:24:15,530 --> 00:24:17,152 So I can imagine 479 00:24:18,202 --> 00:24:21,405 having the lexeme for "house" very impactful, 480 00:24:21,405 --> 00:24:24,977 maybe not a lexeme for a data or some other like. 481 00:24:24,977 --> 00:24:28,947 But I was curious if you had that, it if it is something 482 00:24:30,217 --> 00:24:35,480 that could be used to drive these community efforts. 483 00:24:35,840 --> 00:24:37,066 Great question. 484 00:24:37,200 --> 00:24:41,216 So one thing that Wikidata has a whole lot of-- 485 00:24:41,216 --> 00:24:44,666 sorry, excuse me, PanLex has a whole lot of are Swadesh lists. 486 00:24:44,666 --> 00:24:47,511 We have apparently the largest collection of Swadesh lists in the world 487 00:24:47,511 --> 00:24:48,555 which is interesting. 488 00:24:48,555 --> 00:24:50,212 If you don't know what a Swadesh list is, 489 00:24:50,212 --> 00:24:56,244 it's essentially a regularized list of lexical items 490 00:24:56,244 --> 00:25:00,040 that can be used for analysis of languages. 491 00:25:00,040 --> 00:25:02,730 They contain really basic sets. 492 00:25:02,730 --> 00:25:05,003 So there's a couple of different kinds of Swadesh lists. 493 00:25:05,003 --> 00:25:07,328 But there are 100 or 213 items 494 00:25:07,328 --> 00:25:08,911 and they might contain 495 00:25:08,911 --> 00:25:12,777 words like "house" and "eye" and "skin" 496 00:25:12,777 --> 00:25:14,444 and basically general words 497 00:25:14,444 --> 00:25:16,331 that you should be able to find in any language. 498 00:25:16,331 --> 00:25:19,888 So that's like a really good starting point 499 00:25:19,888 --> 00:25:22,988 for having that kind of data available. 500 00:25:29,090 --> 00:25:31,126 Now, as I mentioned before, 501 00:25:31,133 --> 00:25:33,600 crowdsourcing is something that we don't do yet 502 00:25:33,600 --> 00:25:36,066 and we're actually really excited to be able to do. 503 00:25:36,066 --> 00:25:37,554 It's one of the things I'm really excited 504 00:25:37,554 --> 00:25:38,993 to talk to people at this conference about, 505 00:25:38,993 --> 00:25:42,982 is how crowdsourcing can be used 506 00:25:42,982 --> 00:25:45,931 and the logistics behind it, 507 00:25:46,200 --> 00:25:48,867 and these are the kind of questions that can come up. 508 00:25:51,288 --> 00:25:53,400 So I guess the answer I can say to you 509 00:25:53,400 --> 00:25:55,376 is that we do have a priority list-- 510 00:25:55,376 --> 00:25:57,684 Actually, one thing I can say is we definitely do have a priority list 511 00:25:57,684 --> 00:25:59,730 when it comes to which languages we are seeking out. 512 00:25:59,730 --> 00:26:02,222 So the way we do this is that we look for languages 513 00:26:02,222 --> 00:26:04,666 that are not currently served by technological solutions, 514 00:26:04,666 --> 00:26:06,977 which are oftentimes minority languages, 515 00:26:06,977 --> 00:26:09,280 or usually minority languages, 516 00:26:09,280 --> 00:26:12,096 and then prioritize those. 517 00:26:13,916 --> 00:26:16,844 But in terms of individual lexical items 518 00:26:16,851 --> 00:26:20,244 being the general way we get new data 519 00:26:20,244 --> 00:26:22,977 is essentially by ingesting an entire dictionary's worth. 520 00:26:22,977 --> 00:26:25,911 We are relying on the dictionary's choice 521 00:26:25,911 --> 00:26:29,333 of lexical items, rather than necessarily saying, 522 00:26:29,333 --> 00:26:31,500 we're really looking for the word for "house" in every language. 523 00:26:31,500 --> 00:26:35,000 But when it comes to data crowdsourcing, we will need something like that. 524 00:26:35,000 --> 00:26:37,912 So this is an opportunity for research and growth. 525 00:26:40,044 --> 00:26:43,088 (man 5) Hi, I'm Victor, and this is awesome. 526 00:26:45,108 --> 00:26:46,888 As you have slides here, 527 00:26:46,888 --> 00:26:49,355 can you talk a little bit about the technical status 528 00:26:49,355 --> 00:26:51,260 that currently you have data 529 00:26:51,260 --> 00:26:57,022 or information flow from and to Wikidata and PanLex. 530 00:26:57,022 --> 00:26:59,955 Is that currently implemented already 531 00:26:59,955 --> 00:27:03,888 and how do you deal with 532 00:27:03,888 --> 00:27:07,133 back and forth or even feedback loop information 533 00:27:07,140 --> 00:27:09,950 between PanLex and Wikidata? 534 00:27:09,950 --> 00:27:13,733 So we actually don't have any formal connections to Wikidata at this point, 535 00:27:13,733 --> 00:27:15,343 and this is something that I'm, again, 536 00:27:15,343 --> 00:27:17,824 I'm really excited to talk to people in this conference about. 537 00:27:17,824 --> 00:27:20,644 We've had some interaction with Wiktionary, 538 00:27:21,774 --> 00:27:24,720 but Wikidata is actually a better fit, honestly, 539 00:27:24,720 --> 00:27:26,755 for what we are looking for. 540 00:27:27,355 --> 00:27:29,201 Having directly lexical stuff 541 00:27:29,201 --> 00:27:32,311 means that we have to do a lot less data analysis and extraction. 542 00:27:32,933 --> 00:27:37,148 And so the answer is, we don't yet, but we want to. 543 00:27:37,148 --> 00:27:39,800 (man 5) And if not, what are the obstacles? 544 00:27:39,800 --> 00:27:43,511 And as we can see, Wikidata already supports several languages, 545 00:27:43,511 --> 00:27:46,533 but when I look up *translate.panlex.org*, 546 00:27:46,533 --> 00:27:49,311 you apparently support many, many variants, 547 00:27:49,311 --> 00:27:50,888 much more than Wikidata. 548 00:27:50,888 --> 00:27:53,316 How do you see there is a gap 549 00:27:53,316 --> 00:27:57,177 between translation or lexical translation first, 550 00:27:57,177 --> 00:28:00,155 application versus an effort 551 00:28:00,155 --> 00:28:03,777 as trying to map a knowledge structure. 552 00:28:03,777 --> 00:28:05,866 Mapping knowledge will actually be very interesting. 553 00:28:05,866 --> 00:28:07,336 We've had some very interesting discussions 554 00:28:07,336 --> 00:28:12,311 about the way that Wikidata organizes their lexical data, 555 00:28:12,311 --> 00:28:13,777 , your lexical data, 556 00:28:13,777 --> 00:28:16,044 and how we organize our lexical data. 557 00:28:16,044 --> 00:28:20,933 And there are subtle differences that would require a mapping strategy, 558 00:28:21,460 --> 00:28:24,577 some of which will not necessarily be automatic, 559 00:28:24,577 --> 00:28:27,422 but we might be able to develop techniques to be able to do this. 560 00:28:27,422 --> 00:28:30,796 You gave the example of language variants. 561 00:28:30,796 --> 00:28:34,111 We tend to be very "splittery" when it comes to language variants. 562 00:28:34,111 --> 00:28:36,311 In other words, if we get a source that says 563 00:28:36,311 --> 00:28:38,755 that this is the dialect spoken 564 00:28:38,755 --> 00:28:41,695 on the left side of the river in Papua New Guinea, for this language, 565 00:28:41,695 --> 00:28:42,913 and we get another source that says 566 00:28:42,913 --> 00:28:44,955 this is the dialect spoken on the right side of the river, 567 00:28:44,955 --> 00:28:46,720 then we consider them essentially separate languages. 568 00:28:46,720 --> 00:28:51,072 And so we do this in order to basically preserve the most data that we can. 569 00:28:52,222 --> 00:28:54,355 Being able to map that to how Wikidata does it-- 570 00:28:54,355 --> 00:28:56,938 Actually, what I would love is to have conversations 571 00:28:56,938 --> 00:29:00,696 about how languages 572 00:29:00,696 --> 00:29:06,323 are designated on Wikidata. 573 00:29:08,145 --> 00:29:12,320 Again, we go with the strategy of very much a "splittery" strategy. 574 00:29:13,856 --> 00:29:17,440 We broadly rely on ISO 6393 codes, 575 00:29:17,866 --> 00:29:19,643 which is provided by the Ethnologue, 576 00:29:19,643 --> 00:29:23,840 and then each individual code, we then allow multiple variants within it, 577 00:29:23,840 --> 00:29:29,098 either for script variants or regional dialects or sociolects, etc. 578 00:29:30,240 --> 00:29:32,762 Again, opportunity for discussion and work. 579 00:29:35,622 --> 00:29:39,466 (woman 3) Hi, I would like to know if you have a OCR pipeline 580 00:29:39,466 --> 00:29:44,533 and especially because we've been trying to do OCR on Maya, 581 00:29:44,533 --> 00:29:47,928 and we don't get any results. 582 00:29:47,933 --> 00:29:49,933 It doesn't understand anything-- 583 00:29:49,933 --> 00:29:52,512 - Oh, yeah! (laughs) - (woman 3) And... yeah. 584 00:29:52,512 --> 00:29:56,078 So if your pipelines are available. 585 00:29:56,078 --> 00:30:00,288 And the other one is just on the overlap of ISO codes, 586 00:30:00,288 --> 00:30:01,641 like sometimes they say, 587 00:30:01,641 --> 00:30:04,199 "Oh, this is a language, and this is another language," 588 00:30:04,199 --> 00:30:06,555 but there are sources that say other stuff, 589 00:30:06,555 --> 00:30:10,133 as you were mentioning, but they tend to overlap. 590 00:30:10,133 --> 00:30:12,955 So how do you go on...? Yeah. 591 00:30:12,956 --> 00:30:15,155 Yeah, that's absolutely an amazing question. 592 00:30:15,155 --> 00:30:17,120 I really like it. 593 00:30:17,120 --> 00:30:20,400 So we don't have a formalized OCR pipeline per se; 594 00:30:20,400 --> 00:30:23,533 we do it on a sort of source by source basis. 595 00:30:23,533 --> 00:30:26,266 One of the reasons why is because we oftentimes have sources 596 00:30:26,266 --> 00:30:27,955 that not necessarily need to be OCR'd, 597 00:30:27,955 --> 00:30:29,841 that are available for some of these languages, 598 00:30:29,841 --> 00:30:32,766 and we concentrate on those because they require the least amount of work. 599 00:30:32,766 --> 00:30:35,000 But, obviously, if we really want to dive deep 600 00:30:35,000 --> 00:30:37,056 into some of our sources that are in our backlog, 601 00:30:37,056 --> 00:30:40,896 we're going to need to essentially develop strong OCR pipelines. 602 00:30:40,896 --> 00:30:43,968 But there's another aspect too, which is that, as you mentioned... 603 00:30:44,400 --> 00:30:48,576 like the people who designed OCR engines 604 00:30:49,088 --> 00:30:52,672 I think are not realizing how much you can stress test them. 605 00:30:52,672 --> 00:30:55,181 Like, you know what's fun?-- 606 00:30:55,181 --> 00:30:57,690 trying to OCR a Russian-Tibetan dictionary. 607 00:30:58,600 --> 00:31:00,216 It's really hard, it turns out... 608 00:31:01,503 --> 00:31:03,747 We gave up, and we hired someone to just type it up, 609 00:31:04,022 --> 00:31:05,641 which was totally doable. 610 00:31:05,641 --> 00:31:07,260 And actually, it turns out 611 00:31:07,260 --> 00:31:10,266 that this amazing Russian woman learned to read Tibetan 612 00:31:10,266 --> 00:31:12,755 so she could type this up, which was super cool. 613 00:31:15,333 --> 00:31:18,270 I think that if you're dealing with stuff in the Latin scripts, 614 00:31:18,270 --> 00:31:22,871 then I think that OCR solutions can be developed, that are more robust, 615 00:31:22,871 --> 00:31:24,673 that deal with multilingual sources like this 616 00:31:24,673 --> 00:31:26,991 and expect that you're going to get a random four in there, 617 00:31:26,991 --> 00:31:28,284 if you're dealing with something like 618 00:31:28,284 --> 00:31:30,560 16th-century Mayan sources, you know, with digit four. 619 00:31:32,088 --> 00:31:37,600 But there are some sources 620 00:31:37,600 --> 00:31:40,111 that OCR is probably just never really going to catch up to, 621 00:31:40,111 --> 00:31:42,244 or require such an immense amount of work, 622 00:31:43,200 --> 00:31:46,933 that actually we put a little bit of this to use right now. 623 00:31:46,933 --> 00:31:48,800 We have another project we're running at PanLex 624 00:31:48,800 --> 00:31:53,533 to transcribe all of the traditional literature of Bali, 625 00:31:53,533 --> 00:31:57,952 and we found that in handwritten Balinese manuscripts, 626 00:31:58,444 --> 00:31:59,644 there's just no chance of OCR. 627 00:31:59,644 --> 00:32:02,200 So we got a bunch of Balinese people to type them up, 628 00:32:02,200 --> 00:32:05,000 and it's become a really cool cultural project within Bali, 629 00:32:05,000 --> 00:32:07,288 and it's become news and stuff like that. 630 00:32:07,288 --> 00:32:09,084 So I would say 631 00:32:09,084 --> 00:32:11,377 that you don't necessarily need to rely on OCR, 632 00:32:11,377 --> 00:32:12,577 but there is a lot out there. 633 00:32:12,577 --> 00:32:15,160 So having good OCR solutions would be good. 634 00:32:16,663 --> 00:32:20,992 Also, if anyone out here is into super multilingual OCR, 635 00:32:20,992 --> 00:32:22,635 please come talk to me. 636 00:32:29,517 --> 00:32:31,377 (man 6) Thank you for your presentation. 637 00:32:32,007 --> 00:32:34,866 You talked about integration 638 00:32:34,866 --> 00:32:37,060 between PanLex and Wikidata, 639 00:32:37,060 --> 00:32:38,792 but you haven't gone into the specifics. 640 00:32:38,792 --> 00:32:42,701 So I was checking your data license, and it is under CC0. 641 00:32:42,701 --> 00:32:44,210 - Yes. - (man 6) That's really great. 642 00:32:44,210 --> 00:32:46,377 So there are two possible ways 643 00:32:46,377 --> 00:32:49,400 that either we can import the data 644 00:32:49,400 --> 00:32:52,777 or we can continue something similar to the Freebase way, 645 00:32:52,777 --> 00:32:55,688 where we had the complete database from the Freebase, 646 00:32:55,688 --> 00:32:59,080 and we imported them, and we made a link, 647 00:32:59,080 --> 00:33:03,955 an external identifier to the Freebase database. 648 00:33:03,955 --> 00:33:08,397 So if you have something in mind, are you thinking similar? 649 00:33:08,397 --> 00:33:10,401 Or you just want to make... 650 00:33:15,291 --> 00:33:18,755 an independent database which can be linked to Wikidata? 651 00:33:18,755 --> 00:33:20,533 Yeah, so this is a great question 652 00:33:20,533 --> 00:33:23,282 and actually I feel like it's about one step ahead 653 00:33:23,282 --> 00:33:25,648 of some of the stuff that I've already been thinking about, 654 00:33:25,648 --> 00:33:29,555 partially because, like I said, 655 00:33:29,955 --> 00:33:32,111 getting the two databases to work together 656 00:33:32,111 --> 00:33:33,533 is a step in of itself. 657 00:33:33,533 --> 00:33:35,332 I think the first step that we can take 658 00:33:35,333 --> 00:33:37,622 is literally just pooling our skills together. 659 00:33:37,911 --> 00:33:40,246 We have a lot of experience dealing with stuff 660 00:33:40,246 --> 00:33:42,656 like classifications of properties of individual lexemes 661 00:33:42,656 --> 00:33:44,734 that I'd love to share. 662 00:33:45,864 --> 00:33:49,050 But being able to link the databases themselves would be wonderful. 663 00:33:49,050 --> 00:33:50,808 I'm 100% for that. 664 00:33:50,808 --> 00:33:54,066 I think it would be a little bit easier 665 00:33:54,066 --> 00:33:56,022 on the Wikidata towards PanLex way, 666 00:33:56,022 --> 00:33:58,866 but maybe I'm just biased because I can see how that could work. 667 00:34:02,040 --> 00:34:06,088 Yeah, essentially, as long as Wikidata is comfortable 668 00:34:06,088 --> 00:34:09,620 with all the licensing stuff like that, or we work something out, 669 00:34:09,620 --> 00:34:12,057 then I think that would be a great idea. 670 00:34:13,216 --> 00:34:16,235 We'd just have to figure out ways of linking the data itself. 671 00:34:16,235 --> 00:34:22,234 One thing I can imagine is, essentially, that I would love for edits to Wikidata 672 00:34:22,577 --> 00:34:26,088 to immediately become populated to the PanLex database, 673 00:34:26,088 --> 00:34:28,551 without having to essentially 674 00:34:28,551 --> 00:34:30,786 just reingest it every... 675 00:34:30,786 --> 00:34:35,779 essentially making Wikidata a crowdsourceable interface to PanLex 676 00:34:35,779 --> 00:34:36,888 would be really awesome. 677 00:34:36,888 --> 00:34:39,777 And then being able to use PanLex in immediate translations, 678 00:34:39,780 --> 00:34:42,224 to be able to do translations across Wikidata lexical items-- 679 00:34:42,224 --> 00:34:43,770 that would be glorious. 680 00:34:55,288 --> 00:35:00,266 (man 7) This is like the auditing process of this semantic web 681 00:35:00,266 --> 00:35:03,808 to close holes by inference. 682 00:35:05,682 --> 00:35:09,733 If we think this further, this kind of translation, 683 00:35:09,733 --> 00:35:13,353 how do you deal with semantic mismatch 684 00:35:13,355 --> 00:35:16,088 and grammatical mismatch? 685 00:35:16,088 --> 00:35:18,888 For instance, if you try to translate something in German, 686 00:35:18,888 --> 00:35:21,933 you can simply put several words together 687 00:35:21,933 --> 00:35:25,986 and reach something that's sensible, 688 00:35:25,986 --> 00:35:29,184 and on the other hand, I think I read sometimes 689 00:35:31,450 --> 00:35:38,450 not every language has the same granular system 690 00:35:38,450 --> 00:35:40,453 for colors, for instance. 691 00:35:41,577 --> 00:35:42,800 You said everything 692 00:35:42,800 --> 00:35:45,010 uses a different system for colors or are the same? 693 00:35:45,530 --> 00:35:48,377 (man 7) I remember maybe that it's just about evolution of language 694 00:35:48,377 --> 00:35:51,533 that they started out with black and white and then-- 695 00:35:51,533 --> 00:35:53,333 Yeah, the color hierarchy. 696 00:35:53,333 --> 00:35:54,492 Actually, the color hierarchy 697 00:35:54,492 --> 00:35:57,271 is a great way to illustrate how this works, right? 698 00:35:57,977 --> 00:36:01,400 So, essentially, when you have a single pivot language-- 699 00:36:02,043 --> 00:36:04,822 it's really interesting when you read papers on machine translations 700 00:36:04,822 --> 00:36:08,000 because oftentimes they'll talk about some hypothetical pivot language, 701 00:36:08,000 --> 00:36:09,826 that they say, "Oh yeah, there is a pivot language," 702 00:36:09,826 --> 00:36:12,133 and then you read in the paper and say, "It's English." 703 00:36:12,133 --> 00:36:16,688 And so what this form of lexical translation does, 704 00:36:16,680 --> 00:36:20,352 by passing it through many different intermediate languages, 705 00:36:20,755 --> 00:36:26,142 it has the effect of being able to deal with a lot of semantic ambiguity. 706 00:36:26,142 --> 00:36:28,426 Because as long as you're passing it through languages 707 00:36:28,426 --> 00:36:33,408 that contain the same reasonably similar semantic boundaries to a word, 708 00:36:33,408 --> 00:36:37,038 then you can avoid the problem of essentially 709 00:36:37,038 --> 00:36:39,808 introducing semantic ambiguity through the pivot language. 710 00:36:39,808 --> 00:36:43,266 So using the color hierarchy thing as an example, 711 00:36:43,266 --> 00:36:46,460 if you take a language that has a single color word for green and blue 712 00:36:46,460 --> 00:36:50,688 and it translates it into blue 713 00:36:50,688 --> 00:36:53,244 in your single pivot language 714 00:36:53,244 --> 00:36:54,477 and then into another language 715 00:36:54,477 --> 00:36:57,422 that has different ambiguities on these things, 716 00:36:57,422 --> 00:37:00,283 then you end up introducing semantic ambiguity. 717 00:37:00,283 --> 00:37:02,370 But if you pass it through a bunch of other languages 718 00:37:02,370 --> 00:37:05,660 that also contain a single lexical item for green and blue, 719 00:37:05,660 --> 00:37:10,666 then, essentially, that semantic specificity 720 00:37:11,040 --> 00:37:16,990 gets passed along to the resultant language. 721 00:37:17,755 --> 00:37:20,666 As far as the grammatical feature aspects, 722 00:37:20,666 --> 00:37:23,488 PanLex has been primarily, in its history, 723 00:37:23,488 --> 00:37:28,960 collecting essentially lexemes, essentially lexical forms. 724 00:37:29,711 --> 00:37:31,800 And, by that, I mean, essentially, 725 00:37:31,804 --> 00:37:33,840 whatever you get as the headword for a dictionary. 726 00:37:34,800 --> 00:37:38,170 So we don't necessarily concentrate at this time 727 00:37:38,555 --> 00:37:40,955 on collecting grammatical variant forms, 728 00:37:40,955 --> 00:37:43,360 things like [inaudible] data, etc. 729 00:37:43,360 --> 00:37:44,830 or past tense and present tense. 730 00:37:44,830 --> 00:37:46,487 But it's something we're looking into. 731 00:37:46,488 --> 00:37:48,420 One thing that it's always important to remember 732 00:37:48,420 --> 00:37:50,600 is that because our focus is-- 733 00:37:51,422 --> 00:37:54,490 is on underserved and endangered minority languages, 734 00:37:55,000 --> 00:37:57,777 we want to make sure that something is available 735 00:37:57,777 --> 00:37:59,711 before we make it perfect. 736 00:38:01,621 --> 00:38:02,844 A phrase I absolutely love 737 00:38:02,844 --> 00:38:04,927 is "Don't let the perfect be the enemy of the good," 738 00:38:04,927 --> 00:38:06,570 and that's what we intend to do. 739 00:38:06,570 --> 00:38:09,014 But we are super interested in the idea 740 00:38:09,014 --> 00:38:12,266 of being able to handle grammatical forms, 741 00:38:12,266 --> 00:38:14,031 and being able to translate across grammatical forms, 742 00:38:14,031 --> 00:38:15,665 and it's some stuff we've done some research on 743 00:38:15,665 --> 00:38:17,468 but we haven't fully implemented yet. 744 00:38:25,350 --> 00:38:28,777 (man 8) So, of the 7,500 or so languages, 745 00:38:30,448 --> 00:38:33,111 I assume you're relying on dictionaries which are written for us, 746 00:38:33,111 --> 00:38:36,222 but do all those languages have standard written forms 747 00:38:36,222 --> 00:38:38,101 and how do you deal with...? 748 00:38:38,101 --> 00:38:39,887 That's a great question. 749 00:38:42,111 --> 00:38:45,062 Essentially, yes, a lot of these languages 750 00:38:45,066 --> 00:38:47,977 as everyone's aware, are unwritten. 751 00:38:47,977 --> 00:38:50,666 However, any language for which a dictionary has been produced 752 00:38:50,666 --> 00:38:52,466 has some kind of orthography, 753 00:38:52,466 --> 00:38:56,710 and we rely on the orthography produced for the dictionary. 754 00:38:56,710 --> 00:38:59,686 We occasionally do some slight massaging of orthography 755 00:39:00,956 --> 00:39:03,177 if we can guarantee it to be lossless, basically. 756 00:39:03,177 --> 00:39:05,377 But we tend to avoid it as much as possible. 757 00:39:07,533 --> 00:39:11,485 So, essentially, we don't get into the business 758 00:39:11,485 --> 00:39:13,229 of developing orthographies for languages, 759 00:39:13,229 --> 00:39:14,967 because oftentimes they haven't developed, 760 00:39:14,967 --> 00:39:17,240 even if they're not really widely published. 761 00:39:17,240 --> 00:39:22,155 So, for example, 762 00:39:22,155 --> 00:39:26,022 for a lot of languages that are spoken in New Guinea, 763 00:39:26,488 --> 00:39:29,125 there may not be a commonly used orthographic form, 764 00:39:29,125 --> 00:39:30,980 but some linguists just come up with something 765 00:39:30,980 --> 00:39:32,333 and that's a good first step. 766 00:39:33,473 --> 00:39:36,730 We also collect phonetic forms when they're available in dictionaries, 767 00:39:36,730 --> 00:39:38,400 and so that's another way in, 768 00:39:38,400 --> 00:39:40,533 essentially an IPA representation of the word, 769 00:39:40,533 --> 00:39:41,800 if that's available. 770 00:39:41,800 --> 00:39:43,333 So that can also be used as well. 771 00:39:43,333 --> 00:39:45,755 But we don't just typically use that as a pivot 772 00:39:45,755 --> 00:39:48,226 because it introduces certain ambiguities. 773 00:39:52,666 --> 00:39:55,466 (woman 4) Thank you, this might be a super silly question, 774 00:39:56,044 --> 00:40:00,572 but are those only the intermediate languages you work with? 775 00:40:00,572 --> 00:40:02,215 Oh, no. Oh, no. 776 00:40:02,222 --> 00:40:03,790 (woman 4) Oh, yes, alright. Thank you. 777 00:40:03,790 --> 00:40:05,683 No, I'm glad you asked. It answers the question. 778 00:40:05,683 --> 00:40:11,311 So this is actually a screenshot snap from *translate.panlex.org*. 779 00:40:11,311 --> 00:40:12,826 If you do a translation, 780 00:40:12,826 --> 00:40:15,022 you'll get a list of translations on the right side. 781 00:40:15,022 --> 00:40:17,874 You click a little *dot dot dot* button, you'll get a graph like this. 782 00:40:17,874 --> 00:40:21,760 And what this shows is the intermediate languages, 783 00:40:22,010 --> 00:40:24,133 the top 20 by score-- 784 00:40:24,133 --> 00:40:26,093 I could go into the details of how we do the score 785 00:40:26,093 --> 00:40:27,452 but it's not super important now-- 786 00:40:27,452 --> 00:40:30,244 by score that are being used. 787 00:40:30,244 --> 00:40:33,393 But to make the translation, we're actually using way more than 20. 788 00:40:33,393 --> 00:40:35,797 The reason I cap it at 20 is because if you have more than 20-- 789 00:40:35,797 --> 00:40:37,661 like this is actually a kind of a physics simulation 790 00:40:37,661 --> 00:40:39,638 you can move the things around and they squiggle. 791 00:40:39,638 --> 00:40:42,200 If you have more than 20, your computer gets really mad. 792 00:40:45,400 --> 00:40:47,419 So it's more of a demonstration, yeah. 793 00:40:55,955 --> 00:40:57,888 (woman 5) Leila, from Wikimedia Foundation. 794 00:40:57,888 --> 00:41:00,155 Just one note on-- 795 00:41:00,155 --> 00:41:03,260 You mentioned Wikimedia Foundation a couple of times in your presentation, 796 00:41:03,260 --> 00:41:06,533 I wanted to say if you want to do any kind of data ingestion 797 00:41:06,533 --> 00:41:08,460 or a collaboration with Wikidata, 798 00:41:08,820 --> 00:41:11,200 perhaps Wikimedia Deutschland would be a better place 799 00:41:11,200 --> 00:41:13,182 to have these conversations with? 800 00:41:13,182 --> 00:41:16,256 Because Wikidata lives within Wikimedia Deutschland 801 00:41:16,256 --> 00:41:17,511 and the team is there, 802 00:41:17,511 --> 00:41:19,971 and also the community of volunteers around Wikidata 803 00:41:19,977 --> 00:41:23,710 would be the perfect place to talk 804 00:41:23,710 --> 00:41:25,590 about any kind of ingestions 805 00:41:25,590 --> 00:41:31,136 or working with bringing PanLex closer to Wikidata. 806 00:41:31,577 --> 00:41:32,688 Great, thank you very much, 807 00:41:32,688 --> 00:41:34,901 because honestly I'm not exactly super familiar 808 00:41:34,901 --> 00:41:37,823 with all of the intricacies of the architecture 809 00:41:37,823 --> 00:41:39,740 of how all the projects relate to each other. 810 00:41:39,740 --> 00:41:41,977 I'm guessing by the laughs that it's complicated. 811 00:41:41,977 --> 00:41:44,333 But, yeah, so basically we would want to talk 812 00:41:44,333 --> 00:41:48,333 with whoever is responsible for Wikidata. 813 00:41:48,333 --> 00:41:52,120 So just do a little [inaudible] place thing, 814 00:41:52,860 --> 00:41:56,470 whoever is responsible for Wikidata, that's who we're interested in talking to, 815 00:41:56,470 --> 00:41:58,264 which is all of you volunteers. 816 00:42:03,266 --> 00:42:05,044 Any further questions? 817 00:42:10,066 --> 00:42:14,400 Okay, well, if anyone does end up having any further questions beyond this 818 00:42:14,400 --> 00:42:17,711 or ones that I talked about-- the details and specifics about these things, 819 00:42:17,711 --> 00:42:19,800 please come and talk to me, I'm super interested. 820 00:42:19,800 --> 00:42:23,977 And especially if you're dealing with anything involving lexical stuff, 821 00:42:23,977 --> 00:42:28,666 anything involving endangered minority languages 822 00:42:28,666 --> 00:42:30,444 and underserved languages, 823 00:42:30,444 --> 00:42:34,410 and also Unicode, which is something I do as well. 824 00:42:36,220 --> 00:42:37,800 So thank you very much 825 00:42:37,800 --> 00:42:39,563 and thank you for inviting me to come speak, 826 00:42:39,563 --> 00:42:41,550 I'm hoping that you enjoyed all this. 827 00:42:41,550 --> 00:42:43,753 (applause)