Pseudonmization Techniques for Privacy Study with Clinical Data

Privacy includes the right of individuals and organizations to determine for themselves when, how and to what extent information about them is communicated to others. The growing need of managing large amounts of data in hospital or clinical raises important legal and ethical challenges. This paper introduces and show the testing implementation of the privacy-protection problems, and highlights the relevance of trusted third parties and of privacy-enhancing techniques (PETs) in the context of data collection, e.g., for research. Practical approaches on the pseudonymization model for batch data collection are presented. The actual application of the described techniques today proves the possible benefits for health that innovative privacy-enhancing techniques can provide. Technical PET solutions can unlock valuable data sources, otherwise not available.


Introduction
Organizations like hospitals, clinics or pharmacies collect, store and process vast amounts of personal data. They are interested in releasing their data for research or other public benefit purposes. However, much of this data is of a sensitive nature (e.g. medical data) and although generally used for the benefit of the community, it can be easily abused by malicious people.
Currently incidents are frequently reported in the public med ia, but lack of concern about the sensitive data. However, peop le tend to b eco me mo re app rehens iv e when th eir personal healthcare-related data are at stake, mainly because they can easily imagine motives for abuse and understand the impact of such abuse. Another obvious case in point is that at some point in their life practically everyone is confronted with loan and insurance applications. Recent incidents such as the one in which an outsourced transcribers' threatened to disclose all medical records she had been processing for a US hospital [1] clearly illustrate that the threat to p rivacy is genuine. Public authorities are also sharply aware of these repercussions, and they are putting considerable effort into privacy protect ion leg islat ion [2,3]. Nowadays, we can 't deny that p riv acy p rotect ion d irect ly impacts p ersonal well-being as well as society as a whole. Indeed, some go as far as to believe that failure to protect privacy might lead to our ruin [4]. Privacy is in fact recognized as a fundamental human right.
In Malaysia; until now there are no special bodies that are pay careful attention to the requirement of obtain ing the informed consent from subjects. Because of that, most of the hospital or clinical are too caution regarding assessing their informat ion because they known the impact of the informat ion enclosed are very complex; thus a real danger that informed consent is rather an ill-informed consent. Research ethics and security guidelines demand research units to divert more and more resources and time to p rivacy and identity protection, but burdensome requirements governing the transmission of medical information could unnecessarily discourage research. Well-intentioned privacy laws should not clash with the legit imate use of information when clearly to the public's benefit.
Protecting human rights, for example, privacy, wh ile maximizing research productivity is a major challenge. A first step towards this goal is the research and implementation of technical solutions to the privacy problem. Privacy-enhancing techniques or technologies (PETs) should be provided with to unlock invaluable data sources for the benefit of society without endangering individual privacy. This paper focuses on the possible use of privacy enhancing techniques in the context of research and statistics for health care.

Privacy Enhancing Techniques
There are many situations in which privacy can be an issue. Until now, much research covers many d ifferent areas, including anonymous: • Co mmun ication (anonymous remailers, anonymous surfing, etc.), • Transactions, • Publication and storage, • Credentials, • In files and databases In this paper focusing is more to medical applications, in which privacy issues are raised by the informat ion content of the stored data. Privacy-enhancing techniques for privacy protection within databases help us to protect the privacy of a subject of a database record like person records or organization records that listed in the database. Simply put, these privacy-enhancing techniques allow storing relevant and useful information in a way that no one can ever find out, who the informat ion is actually about. Lists are some of the examples of these techniques are (non exhaustive list): • ''Hard" de-identification by the owner of the data; • Various types of anonymizat ion and/or pseudonymizati on; • Privacy risk assessment techniques; • Controlled database alteration (modification, swapping or dilution of data); • Data flow segmentation; Today, privacy-enhancing technique technology has already proven its usefulness for privacy protection in market ing and research data collection in United State [5] and even in Malaysia or other Asia countries like Singapore, Japan and etc, the (PETs) is growing up parallel with the countries urbanization. However in this paper, our focus with the lays on imp lementation of pseudonymizat ion techniques, and complementary PETs enhancing with clin ic environment in Malaysia country; and our experiment is one of the public hospital, in south city.

Pseudonymization Techniques
Pseudonymization refers to privacy-enhancing techniques and methods used to replace the t rue (no minative) identities of ind ividuals or organizat ions in databases by pseudo-ident ities (pseudo-IDs) that cannot be lin ked directly to their corresponding nominative identities [6].
With this technique, the data that contains, identifiers and ''payload data" (non-identifying data) are separated. The pseudonymization process translates the given identifiers into a pseudo-ID by using secure, dynamic and preferably irreversib le cryptographic techniques (the identifier transfor mat ion process should not be performed with translation tables). For an observer, the resulting pseudo-Ids are thus represented by complete random selections of characters. This transformat ion can be implemented differently according to the project requirements. Pseudonymization can: • always map a g iven identifier with the same pseudo-ID; • map a given identifier with a different pseudo-ID; • time-dependant (e.g. always varying or changing over specified time intervals); • location-dependant (e.g. changing when the data co mes fro m d ifferent places); • content-dependant (e.g. changing according to the content); Pseudonymization is used in data collection scenarios where large amounts of data from d ifferent sources are gathered for statistical processing and data mining (e.g. research studies). In contrast with horizontal types of data exchange (e.g. for d irect care), vert ical communication scenarios (e.g. in the context of disease management studies and other research) do not require identities as such: here pseudonymization can help find solutions. It is a powerful and flexible tool for privacy protection in databases, which is able to reconcile the two following conflict ing requirements: the adequate protection of individuals and organizations with respect to their identity and privacy, and the possibility of lin king data associated with the same data subject (through the pseudo-IDs) irrespective of the collection time and place.
Because of this flexib ility, however, correct use of pseudonymization technology is not as straightforward as often suggested. Careless use of pseudonymization technology could lead to a false feeling of privacy protection. The danger mainly lies within the separation of identifiers and payload.
The important things that should be alert before we precede this process make sure that payload data does not contain any fields that could lead to indirect re-identification, i.e. re-identification based on content, not on identifiers. The key to good privacy protection through pseudonymization is thus careful privacy assessment. Privacy gauging or p rivacy risk assessment is measuring the risk that a subject in a ''privacy protected" database can be re-identified without cooperation of that subject or against his or her will. This consists in measuring the likelihood that a data subject could be re-identified using the information that is availab le (hidden) in the database. The lower this re-identification risk, the better the privacy of the subject listed in that database is protected. Conducting a privacy analysis is a difficult task. At this point in time, no single measure for database privacy is fu lly satisfying and this matter is still a hot topic in scientific co mmunities. Ho wever, extensive research, main ly conducted by statisticians (area of statistical databases, etc.) and computer scientists like data miners or security experts are making significant progress.
Fro m our literature view, using privacy risk assessment techniques, pseudonymization performance can be guaranteed. Data collection models are used to estimate the risk level for re-identification by attackers (a priori risk assessment). How the data should be separated (identifiers versus payload), filtered (removal of information) and transformed (transforming payload information in order to make it less identifying) is subsequently determined on the basis of these results. This means that in fact that one of the uses of privacy risk assessment techniques is to determine correct configuration of PETs.
Many more aspects of the pseudonymizat ion process are closely linked and key to ensuring optimu m p rivacy protection, as for examp le, the location of the identifier and payload processing, the number of steps in which the pseudonymization is performed.

Pseudonymization Implementations
The pseudonymization as described above provides privacy protection for data collection fo r research and market studies. Two logical entit ies involved in handling the data are: 1. The data suppliers or 'sources'; 2. The data collectors, one or several 'data registers' where the pseudonymized data are stored. Data suppliers typically have access to nominative data (e.g. treating doctors), the data collectors should only have access to anonymous data. In this research, a possible scenario is the use of pseudonymization in batch data collection. The three interacting entities are shown in Figure 1. In contrast to traditional data collection, the sources (e.g. electronic med ical record systems) do not necessarily interact directly with the database and vice versa. Co mmunicat ion is routed through a pseudonymizat ion server (TTP server), where the pseudonymization and the processing of relevant data take place, as required.

Batch Data Collection
Data is gathered and packed at the sources, typically in local databases. An examp le could be a local patient database which is managed at a clinic. The data is transmitted on a regular basis to the reg ister through the TTP server where it is pseudonymized.
The data that can be ext racted fro m the local databases is split into two variables; identities and (screened) payload data according to rules determined during the privacy risk assessment stage. Identifiers are pre-pseudonymized at the source, like a first transformation into pre-pseudo-IDs is performed. The payload data (assessment data) is filtered for indirect identifying data and transformed it to avoid re-identificat ion of the anonymous data. Finally, the pre-pseudo-IDs are encrypted using a public-key scheme for decryption by the TTP server exclusively. The payload data are public-key encrypted to the register, so that only the register can read the data. Both are then transmitted to the TTP over secure links (authenticated and encrypted).
Full trustworthiness and integrity of the service is thus guaranteed not only by means of policy but also on a technical level. First, because the TTP never actually processes real identities (there is a pre-pseudonymization stage). Second, because although payload information passes through the TTP server, the latter can neither interpret nor modify the assessment data and to fully t rusted this data is encrypted for decryption by the final destination (data register) only.
As a researcher, we believe and understood that although the pre-pseudonymized informat ion leav ing the source no longer contains any real identities, but this does not always guarantee absolute privacy because, as the prepseudonymiz ation software is availab le at many sources, a smart intruder might find a way to map identities with their corresponding pseudoidentities for a 'dict ionary attack' by entering known identities and creating a translation table. This technique may be like such an attack can be prevented by use of tamper-p roof pseudonymization devices; these are however not yet deployed in real data collection scenarios.
Fro m the previous research, we believe by performing a second transformat ion in a centrally controlled location for example in the TTP server, optimu m security can be offered against such malicious attacks and etc. But as already mentioned there are mo re advantages to the use of an intermediary party. As the TTP server dynamically controls the pseudonymization process, additional privacy protecting functionality can be added like mon itoring of incoming identities against such attacks, re-mappings of identifies, data flow segmentation, data source anonymizat ion, etc.
After this second stage, we propose at the TTP in wh ich the pre-pseudonymized identifiers are transformed into the final pseudo-Ids may be by using cryptographic algorithms, both the payload data and the pseudo-Ids are transferred to the register via secure communicat ion.
At the register, the data can then be stored and processed without raising any privacy concerns.

Conclusions
Privacy includes the right of individuals and organizations to determine for themselves on when, how and to what extent informat ion about themselves can be commun icated to others. Several types of privacy-enhancing technologies exist that can be used for the correct treatment of sensitive data in health, but in this paper we focus that advanced pseudonymization techniques can provide optimal p rivacy protection of ind ividuals. The research also shows that the privacy-enhancing techniques currently deployed for med ical research, which proves that the use of pseudonymization and other innovative privacy enhancing techniques can unlock valuable data sources, otherwise legally not available.