Privacy in the Web of Data

Mathieu d'Aquin - @mdaquin

Knowledge Media Institute, The Open University

What is Privacy?

No, really, I'm asking you...

What is Privacy?

No, really, I'm asking you...

Let's ask the Big Web Intelligence...

Let's ask the Big Web Intelligence...

Let's ask the Big Web Intelligence...

Or the small web intelligence(S)


Do you get the same thing as I do in your browser?
In your own language?

Or the small web intelligence(S)

Let's do some ontological thinking...


Assuming Privacy is something an agent experiences, what can be the subject of privacy?

What are the types of agents that can experience it?

Let's do some ontological thinking...


Assuming Privacy is something that applies to something related to the agent, what can be the object of privacy?

What are the types of things on which privacy applies?

Let's do some ontological thinking...


Assuming Privacy is something that the subject applies to the object?

What are the actions that characterise privacy?

Let's ask the experts


Understanding Privacy by Daniel Solove starts with a long and detailed review of the many definitions of privacy that could be found in various types of literature (research, reulation, policies, etc.)

May favorite (so far)

Privacy is not simply an absence of information about us in the minds of others, rather it is the control we have over information about ourselves
-- Charles Fried - 1968

Let's ask the experts

The conclusions are that, first, trying to define privacy in itself won't work...

Let's ask the experts

... and second, a taxonomy of privacy threats

Let's ask the law then...


EU Directive 2002/58 on Privacy and Electronic Communications...
relies on the Data Protection Directive

"right to privacy in the electronic communication sector"

Let's ask the law then...


Directive 2002/58 main points:
  • Data retention: for no longer than necessary
  • Spam: in emails should not exist
  • Cookies: need opt-in when not strickly necessary

Let's ask the law then...


EU Directive 95/46/EC on Data Protection

One has the right to "private and family life, his home and his correspondence"

Let's ask the law then...


Directive 95/46/EC definition of Personal Data

"any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity"

Let's ask the law then...


Directive 95/46/EC main points:
  • Transparency: Right to be informed of data collection and processing. Processing should apply with consent, and only for legitimate purposes.
  • Amounts: Personal data should be collected and processed only in proportions adequate to the explicite purpose.
  • Transfer: Personal data can only be transfered to countries outside the EU if these countries possess adequate levels of protection

Privacy when publishing Web Data?

two approaches - control or anonymity

Access control

Access control

For web data, basic techniques exist:
  • The virtuoso triple store implemens access control lists at the level of data graphs
  • Work from Luca Costabello et al. on the Shi3ld system for contextual access to web data in RDF
  • Work from Sabrina Kirrane in the Secure Manipulation of Linked Data (see ISWC 2013 paper)
But it can get quite more complex than this...

Anonymisation

Anonymisation

or is it de-identification, pseudonymisation, ??

Anonymisation

De-identification

Pseudonymisation

A concrete (bad) example

NYC Taxi Open Data - http://www.andresmh.com/nyctaxitrips/

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
20D9ECB2CA0767CF7A01564DF2844A3E,598CCE5B9C1918568DEE71F43CF26CD2,CMT,1,N,2013-01-07 15:27:48,2013-01-07 15:38:37,1,648,1.70,-73.966743,40.764252,-73.983322,40.743763
496644932DF3932605C22C7926FF0FE0,513189AD756FF14FE670D10B92FAF04C,CMT,1,N,2013-01-08 11:01:15,2013-01-08 11:08:14,1,418,.80,-73.995804,40.743977,-74.007416,40.744343
0B57B9633A2FECD3D3B1944AFC7471CF,CCD4367B417ED6634D986F573A552A62,CMT,1,N,2013-01-07 12:39:18,2013-01-07 13:10:56,3,1898,10.70,-73.989937,40.756775,-73.86525,40.77063

A concrete (bad) example

This is pseudonymisation... badly done. The medallion, which identifies taxis and is a structured code, has been replaced by a MD5 hash. Brut force re-identification is easy, e.g.

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs 9Y99,5296319,VTS,1,,2013-12-06 00:07:00,2013-12-06 00:16:00,5,540,1.85,-73.97953,40.776447,-73.982254,40.754925
9Y99,5296319,VTS,1,,2013-12-06 00:20:00,2013-12-06 00:46:00,5,1560,6.58,-73.985779,40.757317,-73.984543,40.681244
DIP1,111333,VTS,1,,2013-12-03 12:10:00,2013-12-03 12:24:00,5,840,.00,0,0,0,0
SBV106,429925,VTS,1,,2013-12-05 23:04:00,2013-12-05 23:16:00,6,720,2.86,-73.988197,40.731232,-73.96199,40.764343

Another (better) example


"Anonymised" UK MOT result data: http://data.gov.uk/dataset/anonymised_mot_test

24504876|4719809|2013-03-04|4|N|P|24703|NE|LOTUS|EXIGE S TOUR+|ORANGE|P|1796|2009-04-02
24560396|228032|2013-01-18|4|N|P|22925|EH|LOTUS|ESPRIT V8 TURBO|RED|P|3506|2002-02-16
24704328|4748023|2013-03-26|4|PR|P|13546|CW|LOTUS|EXIGE S PERF & TOUR&SPORT|BLUE|P|1796|2009-01-02
24704329|4748023|2013-03-26|4|N|F|13546|CW|LOTUS|EXIGE S PERF & TOUR&SPORT|BLUE|P|1796|2009-01-02
24711321|133287|2013-07-12|4|N|P|842|WN|LOTUS|ELAN|RED|P|1558|1972-06-30
24711911|4752029|2013-06-20|4|N|P|71650|NR|LOTUS|ELITE|SILVER|P|1973|1978-08-01
24716823|4753115|2013-08-02|4|N|P|83213|IP|LOTUS|EUROPA SPECIAL|YELLOW|P|1558|1971-01-01
24718233|4753451|2013-03-08|4|N|P|6799|NG|LOTUS|ELAN S4|YELLOW|P|1558|1969-09-01
25150458|4822367|2013-05-13|4|N|P|30619|SR|LOTUS|ELISE III|GREY|P|1796|2002-10-28
25391989|4869571|2013-08-02|4|N|PRS|38587|CH|LOTUS|ECLAT|BLUE|P|2174|1982-09-09
25392002|4869573|2013-03-30|4|N|P|41814|WA|LOTUS|ESPRIT TURBO|RED|P|2174|1982-10-05
25393808|4869940|2013-07-16|4|N|P|61980|SE|LOTUS|ELAN|RED|P|1558|1966-05-11

Another (better) example


Another example of De-identification/Pseudonymisation, much better than the previous one... But, is it really anonymous?
Say I have a friend with a Cream Lotus Elan - he wants to sell it to me, but won't tell me what's the year and if he had problems at the last MOT.

grep '|LOTUS|ELAN.*|CREAM' test_result_2013.txt
145478650|25464350|2013-02-14|4|PR|P|85914|NG|LOTUS|ELAN +2S|CREAM|P|1558|1969-08-12
145478651|25464350|2013-02-11|4|N|F|85914|NG|LOTUS|ELAN +2S|CREAM|P|1558|1969-08-12
				    

Assessing the anonymity of data - k-anonymity

A dataset is k-anonymous if there are at least k different records sharing the same values.

Example - k=1

PLOTUSEXIGEORANGE2009
PLOTUSESPRITRED2002
PLOTUSELANBLUE2009
PLOTUSEXIGEBLUE2009
FLOTUSELANRED1972
PLOTUSELITESILVER1978
PLOTUSEUROPAYELLOW1971
PLOTUSELANRED1969
PLOTUSELISEGREY2002
PLOTUSECLATBLUE1982
PLOTUSESPRITRED1982
PLOTUSELANRED1966

Example - k=3

PLOTUSELISERED2002
PLOTUSELISERED2002
PLOTUSELANRED2009
PLOTUSELISERED2009
FLOTUSELANRED2002
PLOTUSELISERED2009
PLOTUSELISERED2009
PLOTUSELANRED2009
PLOTUSELISERED2002
PLOTUSELANRED2002
PLOTUSELANRED2009
PLOTUSELISERED2002
PLOTUSELANRED2002

Methods to achieve k-anonymity

  • Obfuscation: Replace problematic values with a *
  • Generalisation: Replace problematic values by more general values

Example

PLOTUSEXIGEORANGE2009
PLOTUSESPRITRED2002
PLOTUSELANBLUE2009
PLOTUSEXIGEBLUE2009
FLOTUSELANRED1972
PLOTUSELITESILVER1978
PLOTUSEUROPAYELLOW1971
PLOTUSELANRED1969
PLOTUSELISEGREY2002
PLOTUSECLATBLUE1982
PLOTUSESPRITRED1982
PLOTUSELANRED1966

Example obfuscation - k=3

PLOTUS***
PLOTUS*RED*
PLOTUS***
PLOTUS***
FLOTUSELANRED*
PLOTUS***
PLOTUS*RED*
PLOTUSELANRED*
PLOTUS***
PLOTUS***
PLOTUS*RED*
PLOTUSELANRED*

Example - de-identification - k=3

PLOTUSE*DARK2000s
PLOTUSE*DARK2000s
PLOTUSE*DARK2000s
PLOTUSE*DARK2000s
FLOTUSELANRED*
PLOTUSE***
PLOTUSE***
PLOTUSELANRED*
PLOTUSE*DARK2000s
PLOTUSE*DARK*
PLOTUSE***
PLOTUSELANRED*

Applying this to the web of data (graph)?

  • Not record or rows, but nodes and entities
  • No columns and fields, but properties and values

Exercise: De-identification with SPARQL

Given the graph http://data.open.ac.uk/context/people/kmifoaf on data.open.ac.uk (accessible through data.open.ac.uk/sparql),
create a de-identified version (removing identifying info, such as names and labels of people) in a construct query.

Exercise: K-Anonymity with SPARQL

Same graph, de-identified, can you write a SPARQL query that calculates k - and tell you which prop-value to ofuscate/generalise?
Is it possible to do the obfuscation directly in a SPARQL construct query?

Conclusion

There is much more to look at - differential privacy, l-diversity, etc.
(privacy could a whole summer school in itself).

Conclusion

And we are only starting to see the new challenges that the web of data brings to privacy (data integration, lack of control, etc.)...
Imagine Bruce wayne on facebook posting pictures of his hollidays with Poison Ivy...

Conclusion

... as well as the new opportunities (sense making, user-centric data integration, etc.)

see d'Aquin et al. demo @ ISWC 2013

Conclusion


see d'Aquin et al. @ Privon 2013

Conclusion

As the web of data growth, these opportunities and challenges get amplified.
Many research opportunities for web data, linked data and semantic web people!

Thank you

This presentation at http://mdaquin.net/pres/wiss2014

mdaquin.net

@mdaquin


semprivacy.com