Privacy in the Web of Data
Mathieu d'Aquin - @mdaquin
Knowledge Media Institute, The Open University
What is Privacy?
No, really, I'm asking you...
What is Privacy?
No, really, I'm asking you...
Let's ask the Big Web Intelligence...
Let's ask the Big Web Intelligence...
Let's ask the Big Web Intelligence...
Or the small web intelligence(S)
Do you get the same thing as I do in your browser?
In your own language?
Or the small web intelligence(S)
Let's do some ontological thinking...
Assuming Privacy is something an agent experiences, what can be the subject of privacy?
What are the types of agents that can experience it?
Let's do some ontological thinking...
Assuming Privacy is something that applies to something related to the agent, what can be the object of privacy?
What are the types of things on which privacy applies?
Let's do some ontological thinking...
Assuming Privacy is something that the subject applies to the object?
What are the actions that characterise privacy?
Let's ask the experts
|
Understanding Privacy by Daniel Solove starts with a long and detailed review of the many definitions of privacy that could be found in various types of literature (research, reulation, policies, etc.)
|
May favorite (so far)
Privacy is not simply an absence of
information about us in the minds of others,
rather it is the control we have over
information about ourselves
-- Charles Fried - 1968
Let's ask the experts
The conclusions are that, first, trying to define privacy in itself won't work...
Let's ask the experts
... and second, a taxonomy of privacy threats
Let's ask the law then...
EU Directive 2002/58 on Privacy and Electronic Communications...
relies on the Data Protection Directive
"right to privacy in the electronic communication sector"
Let's ask the law then...
Directive 2002/58 main points:
- Data retention: for no longer than necessary
- Spam: in emails should not exist
- Cookies: need opt-in when not strickly necessary
Let's ask the law then...
EU Directive 95/46/EC on Data Protection
One has the right to "private and family life, his home and his correspondence"
Let's ask the law then...
Directive 95/46/EC definition of Personal Data
"any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity"
Let's ask the law then...
Directive 95/46/EC main points:
- Transparency: Right to be informed of data collection and processing. Processing should apply with consent, and only for legitimate purposes.
- Amounts: Personal data should be collected and processed only in proportions adequate to the explicite purpose.
- Transfer: Personal data can only be transfered to countries outside the EU if these countries possess adequate levels of protection
Privacy when publishing Web Data?
two approaches - control or anonymity
Access control
Access control
For web data, basic techniques exist:
- The virtuoso triple store implemens access control lists at the level of data graphs
- Work from Luca Costabello et al. on the Shi3ld system for contextual access to web data in RDF
- Work from Sabrina Kirrane in the Secure Manipulation of Linked Data (see ISWC 2013 paper)
But it can get quite more complex than this...
Anonymisation
Anonymisation
or is it de-identification, pseudonymisation, ??
Anonymisation
De-identification
Pseudonymisation
A concrete (bad) example
NYC Taxi Open Data - http://www.andresmh.com/nyctaxitrips/
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
20D9ECB2CA0767CF7A01564DF2844A3E,598CCE5B9C1918568DEE71F43CF26CD2,CMT,1,N,2013-01-07 15:27:48,2013-01-07 15:38:37,1,648,1.70,-73.966743,40.764252,-73.983322,40.743763
496644932DF3932605C22C7926FF0FE0,513189AD756FF14FE670D10B92FAF04C,CMT,1,N,2013-01-08 11:01:15,2013-01-08 11:08:14,1,418,.80,-73.995804,40.743977,-74.007416,40.744343
0B57B9633A2FECD3D3B1944AFC7471CF,CCD4367B417ED6634D986F573A552A62,CMT,1,N,2013-01-07 12:39:18,2013-01-07 13:10:56,3,1898,10.70,-73.989937,40.756775,-73.86525,40.77063
A concrete (bad) example
This is pseudonymisation... badly done. The medallion, which identifies taxis and is a structured code, has been replaced by a MD5 hash. Brut force re-identification is easy, e.g.
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs
9Y99,5296319,VTS,1,,2013-12-06 00:07:00,2013-12-06 00:16:00,5,540,1.85,-73.97953,40.776447,-73.982254,40.754925
9Y99,5296319,VTS,1,,2013-12-06 00:20:00,2013-12-06 00:46:00,5,1560,6.58,-73.985779,40.757317,-73.984543,40.681244
DIP1,111333,VTS,1,,2013-12-03 12:10:00,2013-12-03 12:24:00,5,840,.00,0,0,0,0
SBV106,429925,VTS,1,,2013-12-05 23:04:00,2013-12-05 23:16:00,6,720,2.86,-73.988197,40.731232,-73.96199,40.764343
Another (better) example
"Anonymised" UK MOT result data: http://data.gov.uk/dataset/anonymised_mot_test
24504876|4719809|2013-03-04|4|N|P|24703|NE|LOTUS|EXIGE S TOUR+|ORANGE|P|1796|2009-04-02
24560396|228032|2013-01-18|4|N|P|22925|EH|LOTUS|ESPRIT V8 TURBO|RED|P|3506|2002-02-16
24704328|4748023|2013-03-26|4|PR|P|13546|CW|LOTUS|EXIGE S PERF & TOUR&SPORT|BLUE|P|1796|2009-01-02
24704329|4748023|2013-03-26|4|N|F|13546|CW|LOTUS|EXIGE S PERF & TOUR&SPORT|BLUE|P|1796|2009-01-02
24711321|133287|2013-07-12|4|N|P|842|WN|LOTUS|ELAN|RED|P|1558|1972-06-30
24711911|4752029|2013-06-20|4|N|P|71650|NR|LOTUS|ELITE|SILVER|P|1973|1978-08-01
24716823|4753115|2013-08-02|4|N|P|83213|IP|LOTUS|EUROPA SPECIAL|YELLOW|P|1558|1971-01-01
24718233|4753451|2013-03-08|4|N|P|6799|NG|LOTUS|ELAN S4|YELLOW|P|1558|1969-09-01
25150458|4822367|2013-05-13|4|N|P|30619|SR|LOTUS|ELISE III|GREY|P|1796|2002-10-28
25391989|4869571|2013-08-02|4|N|PRS|38587|CH|LOTUS|ECLAT|BLUE|P|2174|1982-09-09
25392002|4869573|2013-03-30|4|N|P|41814|WA|LOTUS|ESPRIT TURBO|RED|P|2174|1982-10-05
25393808|4869940|2013-07-16|4|N|P|61980|SE|LOTUS|ELAN|RED|P|1558|1966-05-11
Another (better) example
Another example of De-identification/Pseudonymisation, much better than the previous one... But, is it really anonymous?
Say I have a friend with a Cream Lotus Elan - he wants to sell it to me, but won't tell me what's the year and if he had problems at the last MOT.
grep '|LOTUS|ELAN.*|CREAM' test_result_2013.txt
145478650|25464350|2013-02-14|4|PR|P|85914|NG|LOTUS|ELAN +2S|CREAM|P|1558|1969-08-12
145478651|25464350|2013-02-11|4|N|F|85914|NG|LOTUS|ELAN +2S|CREAM|P|1558|1969-08-12
Assessing the anonymity of data - k-anonymity
A dataset is k-anonymous if there are at least k different records sharing the same values.
Example - k=1
| P | LOTUS | EXIGE | ORANGE | 2009 |
| P | LOTUS | ESPRIT | RED | 2002 |
| P | LOTUS | ELAN | BLUE | 2009 |
| P | LOTUS | EXIGE | BLUE | 2009 |
| F | LOTUS | ELAN | RED | 1972 |
| P | LOTUS | ELITE | SILVER | 1978 |
| P | LOTUS | EUROPA | YELLOW | 1971 |
| P | LOTUS | ELAN | RED | 1969 |
| P | LOTUS | ELISE | GREY | 2002 |
| P | LOTUS | ECLAT | BLUE | 1982 |
| P | LOTUS | ESPRIT | RED | 1982 |
| P | LOTUS | ELAN | RED | 1966 |
Example - k=3
| P | LOTUS | ELISE | RED | 2002 |
| P | LOTUS | ELISE | RED | 2002 |
| P | LOTUS | ELAN | RED | 2009 |
| P | LOTUS | ELISE | RED | 2009 |
| F | LOTUS | ELAN | RED | 2002 |
| P | LOTUS | ELISE | RED | 2009 |
| P | LOTUS | ELISE | RED | 2009 |
| P | LOTUS | ELAN | RED | 2009 |
| P | LOTUS | ELISE | RED | 2002 |
| P | LOTUS | ELAN | RED | 2002 |
| P | LOTUS | ELAN | RED | 2009 |
| P | LOTUS | ELISE | RED | 2002 |
| P | LOTUS | ELAN | RED | 2002 |
Methods to achieve k-anonymity
- Obfuscation: Replace problematic values with a *
- Generalisation: Replace problematic values by more general values
Example
| P | LOTUS | EXIGE | ORANGE | 2009 |
| P | LOTUS | ESPRIT | RED | 2002 |
| P | LOTUS | ELAN | BLUE | 2009 |
| P | LOTUS | EXIGE | BLUE | 2009 |
| F | LOTUS | ELAN | RED | 1972 |
| P | LOTUS | ELITE | SILVER | 1978 |
| P | LOTUS | EUROPA | YELLOW | 1971 |
| P | LOTUS | ELAN | RED | 1969 |
| P | LOTUS | ELISE | GREY | 2002 |
| P | LOTUS | ECLAT | BLUE | 1982 |
| P | LOTUS | ESPRIT | RED | 1982 |
| P | LOTUS | ELAN | RED | 1966 |
Example obfuscation - k=3
P | LOTUS | * | * | * |
P | LOTUS | * | RED | * |
P | LOTUS | * | * | * |
P | LOTUS | * | * | * |
F | LOTUS | ELAN | RED | * |
P | LOTUS | * | * | * |
P | LOTUS | * | RED | * |
P | LOTUS | ELAN | RED | * |
P | LOTUS | * | * | * |
P | LOTUS | * | * | * |
P | LOTUS | * | RED | * |
P | LOTUS | ELAN | RED | * |
Example - de-identification - k=3
P | LOTUS | E* | DARK | 2000s |
P | LOTUS | E* | DARK | 2000s |
P | LOTUS | E* | DARK | 2000s |
P | LOTUS | E* | DARK | 2000s |
F | LOTUS | ELAN | RED | * |
P | LOTUS | E* | * | * |
P | LOTUS | E* | * | * |
P | LOTUS | ELAN | RED | * |
P | LOTUS | E* | DARK | 2000s |
P | LOTUS | E* | DARK | * |
P | LOTUS | E* | * | * |
P | LOTUS | ELAN | RED | * |
Applying this to the web of data (graph)?
- Not record or rows, but nodes and entities
- No columns and fields, but properties and values
Exercise: K-Anonymity with SPARQL
Same graph, de-identified, can you write a SPARQL query that calculates k - and tell you which prop-value to ofuscate/generalise?
Is it possible to do the obfuscation directly in a SPARQL construct query?
Conclusion
There is much more to look at - differential privacy, l-diversity, etc.
(privacy could a whole summer school in itself).
Conclusion
And we are only starting to see the new challenges that the web of data brings to privacy (data integration, lack of control, etc.)...
Imagine Bruce wayne on facebook posting pictures of his hollidays with Poison Ivy...
Conclusion
see d'Aquin et al. @ Privon 2013
Conclusion
As the web of data growth, these opportunities and challenges get amplified.
Many research opportunities for web data, linked data and semantic web people!