J Forensic Sci, November 2015, Vol. 60, No. 6 doi: 10.1111/1556-4029.12824 Available online at: onlinelibrary.wiley.com TECHNICAL NOTE DIGITAL & MULTIMEDIA SCIENCES Christina A. Malone,1 M.F.S.; Michael J. Salyards,1 Ph.D.; and Meredith Hein,2 B.S. Inter-/Intra-observer Reliability of Hand Assessment Using Skin Detail: A Count-based Method* ABSTRACT: Skin detail of the hand is used in photographic comparisons, yet its reliability has not been evaluated. This study examines a count-based method for documenting skin features. In Part I, 14 individuals counted skin features on 40 color images of the hand, three of which were repeated. An average correlation value of 0.557 was obtained for interobserver assessment; values ranged from 0.545 to 0.832 for intra-observer assessment. The variation in correlation values for hands suggests that there are certain distinguishing characteristics that increase reliability. In Part II, 17 examiners assessed 20 nonrepeated grayscale images of hands by circling skin features. An average correlation value of 0.674 was obtained, but visual assessment of examiner markings suggested some examiners grouped features whereas others viewed them individually. The results suggest further research is warranted, some hands may be more suitable for comparisons, and a standardized method for examining skin features is needed. KEYWORDS: forensic science, image analysis, photographic comparison, human identification, skin manifestations, skin pigmentation Recent developments in court proceedings, such as Daubert and Kumho Tire, and assessments of the fields within forensic science, including The National Academy of Sciences Report, recommend more research to establish a firm foundation for comparative sciences. Previously accepted theories require the backing of rigorous scientific validation (1,2). Through the development of research, forensic fields, including image analysis, will more efficiently aid judges and juries in making their decisions during the judicial process. The need for examination and interpretation within forensic identification sciences demands an increased effort for research and substantiation. The current research addresses the recommendation for study and validation when employing skin detail in forensic image comparisons. Notably, the sustained research in the field of image comparisons is imperative to the empirical support that is required of the forensic sciences. In particular, the concept of “individualization fallacy” suggests that it is not possible to individualize a subject or object by consistent qualities. The question of whether a subject or object can be individualized is not a factor of unique- 1 Defense Forensic Science Center, Documents & Digital Evidence, 4930 N 31st St. Forest Park, GA 30297. 2 24th Air Force, San Antonio, TX. *Presented at the 64th Annual Meeting of the American Academy of Forensic Sciences, February 20–25, 2012, in Atlanta, GA. The opinions or assertions contained herein are the private views of the author and are not to be construed as official or as reflecting the views of the Department of the Army or the Department of Defense. Names of commercial manufacturers or products included are incidental only, and inclusion does not imply endorsement by the authors, DFSC, USACIDC, OPMG, DA, or DoD. Received 5 June 2014; and in revised form 30 Sept. 2014; accepted 6 Oct. 2014. ness, but rather a statement of probability (3). By establishing and conducting research into image comparisons, examiners are given a tool to support their conclusions, while reporting with “appropriate clarity and restraint” (3). Image comparisons are vital processes in forensic image analysis for the identification of a subject or object in an image. When comparing known and questioned images for a photographic comparison, any number of details within the image may be used to draw conclusions regarding the identification or elimination of the subject or object depicted. Individualization is achieved when there is an “agreement of corresponding individual characteristics of such number and significance to preclude the possibility (or probability) of their having occurred by mere coincidence, and establishing that there are no differences that cannot be accounted for” (4). For identification to be possible, actual features must be documented by examiners and must correspond between the known individual and the questioned individual “to the exclusion of all others” (5). In forensic image casework, images of hands have been increasingly used when conducting photographic comparisons. In particular, when examining these images, features, such as scars, tattoos, freckles, moles, sunspots, creases, and any other common features that display a random distribution, can be used to compare questioned and known subjects to aid in identification (6). Collectively, freckles, moles, sunspots, and other dermatological skin features may be referred to as pigmented lesions. While many of these pigmented lesions are not distinctive by themselves, the random distribution may create a pattern that is unique to an individual. Additionally, the consistency of these features throughout an individual’s life demonstrates that pigmented lesions, scars, and tattoos may be useful for the identification of an individual. Published 2015. This article is a U.S. Government work and is in the public domain in the U.S.A. 1605 1606 JOURNAL OF FORENSIC SCIENCES In addition to forensic use, the automated detection of pigmented lesions in digital images has been successfully used in regard to biometrics (7). Forensic comparisons and biometric comparisons share similar premises; both base their assessment on physical features that persist throughout one’s lifetime (8). Recent biometric studies have expanded techniques to include the automated detection of skin detail (9–14). The results of such research suggest that micro-level detail in the skin offers discriminating information about an individual (11). However, in order for these soft biometric markers to be used for the identification purposes, there must be standardization in training and methods to ensure that examiners will see the same markers in each image (15). Through additional research, standardization, and training, skin features will gain further acceptance and reputability in forensic photographic comparisons. The hand, in particular, is a useful tool for identification in crimes, such as child pornography and others, where a face may not be visible (5). By identifying minute details, the hand can be individualized and used in identification (11). There are limitations with such analyses, however, as the image must be of a high enough resolution to ensure that the features can be seen accurately (15). An additional limitation is that, while forensic comparisons of the hand have been conducted, there is little information available as to the consistency and reliability with which examiners are conducting such examinations. This study aims to mitigate this limitation. The purpose of this paper is to give a better understanding of the use of skin detail in hand identification and examine the inter- and intra-observer rates when documenting skin features found on the hand. To fully analyze the factors influencing the documentation of skin features, the current research is divided into two parts. In Part I, inter- and intra-observer abilities were assessed with regard to education and background of the observer. Part II of the study limits the conclusions to only individuals with experience in image analysis. Additionally, Part II includes a more in depth look at how examiners are viewing and documenting skin features. FIG. 1––Diagram of 14 regions of the dorsal side of the hand. Materials and Methods Part I A database of images was obtained by photographing the dorsal side of hands of male and female employees at the United States Army Criminal Investigation Laboratory (USACIL). The hands were photographed in the same orientation with a Nikon (Tokyo, Japan) D3X in RAW format using a 60-mm lens, an aperture of f/25, a shutter speed of 1/60, and an ISO of 100. All images were subsequently saved as TIFF images at a resolution of 240 pixels per inch. The images of 34 different right hands were scaled one-to-one and printed on photographic paper. Three of these images were repeated three times to give a total of 40 images. The images were put into four groups of ten images each. All four groups of images were examined by each observer, one group at a time. Repeated images were placed in separate groups. Fourteen observers participated in the study. The observers were a mixture of trained forensic examiners and college-level interns. Each observer completed a survey that identified his or her education and experience in examination, pattern recognition, and casework. From the answers to these questions, observers were classified as either trained examiners or untrained interns. Each observer was asked to document the skin features on each hand by dividing the hand into fourteen regions (Fig. 1) and counting the number of features in each region, without using any other aids (magnifying glasses, etc.). Features included freckles, moles, sunspots, scars, or other identifying markers. Hair, knuckle creases, vein patterns, and nail features were not counted. The number of features in each region was totaled, as was the feature count for the entire hand. Results for each observer were collected and tabulated for analysis. Part II Additional research was conducted to investigate how skin features were viewed when limiting the observers to only those with image comparison experience. Images of 20 different hands MALONE ET AL. TABLE 1––Part I: R-values—interobserver reliability. Same hands Different hands Pairwise Comp. Avg R Std Dev %RSD 3640 152,880 0.557 0.312 0.320 0.3565 61.3 153.6 were taken in the same orientation, scaled one-to-one, enhanced, gray-scaled, and printed on photographic paper. All other image properties were the same as those of the images used in Part I. The enhancement of images was performed in Adobe Photoshop and was limited to a global contrast adjustment using Levels or Curves. Each set of 20 nonrepeated images was examined by each of 17 observers. The experienced observers were attendees at the 2012 Scientific Working Group on Imaging Technologies (SWGIT) Bi-Annual meeting. To be included, participants were required to have experience in imaging, image analysis, or comparative science. Each observer also completed a survey detailing his or her education and experience. Observers were given a hand diagram showing the 14 regions of the hand (Fig. 1). They were asked to segment the hands in each of the 20 images according to this diagram. Additionally, they were asked to circle any prominent features with a marker on each printed image. The additional marking of the images in this portion of the study was included to enable further assessment of where differences arose when documenting skin features. Results were recorded and tabulated by counting the features marked after examiners had completed the assignment. Results Part I All data were compared according to examiner, region, different hands, and identical hands. Correlation values (R-values) FIG. 2––Part I: Interobserver Gaussian distribution and t-value. . RELIABILITY OF HAND ASSESSMENT USING SKIN DETAIL 1607 were calculated for same hands and different hands to determine the similarity between the results of different examiners for each hand to assess the interobserver reliability. Same hand values were calculated for different observers documenting features on the same hands. Theoretically, if all observers documented the same features, there should be a correlation of one. Different hand values were calculated for observers documenting features on different hands. As this is a comparison of different hands, very low correlation values would be expected. The correlation value obtained when different observers examined the same hands was 0.557, while the correlation value for different hands was 0.312 (Table 1). A student’s t-test confirmed that the correlation value for different hands was significantly lower than for the same hands, but overall lower correlations than expected were obtained for the same hands (Fig. 2). Correlation values were also calculated for the same examiner for repeated hands to assess the intra-observer reliability. The correlation values calculated reflect the same observer documenting the same hand; therefore, it would be expected that correlation values would approach one. Each of the three repeated hands was assessed individually (Table 2). Hand 8 demonstrated the highest correlation value of 0.8323. Hand 39 demonstrated a correlation value of 0.6789. Hand 25 demonstrated a correlation value of 0.5447. Overall, the average correlation value for the three hands was 0.6853. When compared with the interobserver correlation values for different observers assessing the same hands, intra-observer correlation values for the same observers assessing the same hands were not significantly different (Fig. 3). The role of experience was also assessed. Observers were classified as either trained examiners or untrained interns, based on the answers given regarding education and experience. Correlation values were calculated for each group of observers (Fig. 4). There was no significant difference found between the 1608 JOURNAL OF FORENSIC SCIENCES TABLE 2––Part I: R-values—intra-observer reliability. Hand 8 Hand 39 Hand 25 Avg R Pairwise Comps 95% CI 0.8323 0.6789 0.5447 42 42 42 0.0086 0.0097 0.0156 correlation values for interns and examiners when comparing either the same hands or different hands. Part II For Part II of the study, correlation values were again calculated to compare the results for the same hands and different hands to assess interobserver reliability. For different examiners documenting features on the same hand, a correlation value of 0.674 was obtained. When comparing different hands, a correlation value of 0.330 was obtained (Table 3). Again, a student’s ttest confirmed that the correlation value for different hands was significantly lower than for the same hands (Fig. 5). As there were no repeated images in this portion of the study, no correlation values were calculated to assess intra-observer reliability. The years of experience of each examiner were compared by groupings of 1–5 years, 6–10 years, and 11+ years. There was no significant difference in the correlation value obtained with regard to experience (Fig. 6). Discussion and Conclusion It was expected that different observers documenting features on the same hands would have similar findings. While a significantly higher correlation value was obtained for the same hands when compared with the correlation value for different hands, the correlation value was still not as high as desired. The lack of a high correlation values for different observers documenting features on the same hand demonstrates that observers may be FIG. 3––Part I: Intra-observer Gaussian distribution and t-value. FIG. 4––Part I: R-values based on experience. TABLE 3––Part II: R-values for interobserver reliability. Same hands Different hands Pairwise Comp. Avg R Std Dev %RSD 3800 76,000 0.674 0.330 0.2863 0.3626 41.9 121.4 documenting some of the same features, but overall, there is a lack of standardization in how or which features are included. The intra-observer results demonstrate several key points. First, there was a strong correlation for Hand 8. This finding suggests that observers were able to generate repeatable results when given an image. Second, there were lower correlation values for Hands 25 and 39. This finding suggests that there are certain qualities about the hand images that made them more or less easy to document features (Fig. 7). For instance, Hand 8 had several large, easily identifiable skin features. Conversely, Hand 25 had few skin features, and Hand 39 had few skin features and a reduced contrast between the skin tone and color of potential skin features. MALONE ET AL. . RELIABILITY OF HAND ASSESSMENT USING SKIN DETAIL 1609 FIG. 5––Part II: Gaussian distribution and t-value. FIG. 6––Part II: R-values based on experience. FIG. 7––Hands 8, 39, and 25. To further assess the possibility that some hands are easier to identify then others, individual correlation values were examined for each of the hands. In Part I, any hands with a correlation value of 0.70 or higher were grouped together (Fig. 8), and those with a correlation value of 0.40 or lower were grouped together (Fig. 9). This grouping was repeated in Part II (Figs 10 and 11). A visual assessment was conducted to compare these two groups of hands. The hands with clearly defined features and increased contrast between the skin tone and skin features tended to have higher correlations, whereas the hands lacking in features or contrast tended to have lower correlations. While further study should be conducted to confirm this trend, these findings demonstrate that hands with easily identifiable features will generate more reliable intra- and interobserver results. This finding demonstrates that care should be taken when conducting photographic comparisons between images of hands that have 1610 JOURNAL OF FORENSIC SCIENCES FIG. 8––Part I: Hands with high correlations (>0.70). FIG. 9––Part I: Hands with low correlations (<0.40). very few skin features. If few skin features are documented, other qualities of the subject in the image need to be taken into account. It had been expected that trained examiners would have more consistent results than the untrained interns. Based on the correlation values obtained, experience did not play a significant role in which features were documented on each of the hands. This finding supports the conclusion that there is a lack of standardization when assessing skin features. Documenting skin features may not require extensive experience, but experience is likely to play a role when an actual conclusion is given in regard to whether the hand in an image can be identified or eliminated. Additionally, Part II of the study had observers mark on the images to document the features observed instead of merely providing a count for each region. While not statistically assessed, there appeared to be two different approaches to how images were marked. The first technique was demonstrated by examiners marking each individual feature, whereas the second style was exhibited by examiners marking features in groups. For instance, one examiner may circle a group of five pigmented lesions and identify it as one unique feature; whereas, another examiner would mark each of the five pigmented lesions separately. Clearly, the difference between two such approaches would make a count-based method problematic. For such an approach to be repeatable and reliable, it would be necessary to standardize the method and the training examiners receive. Additionally, this visual discovery of two different methods demonstrated that observers may be seeing the same features, MALONE ET AL. . RELIABILITY OF HAND ASSESSMENT USING SKIN DETAIL 1611 FIG. 10––Part II: Hands with high correlations (>0.70). FIG. 11––Part II: Hands with low correlations (<0.40). but the method of documentation was where differences were arising. This postulation, in particular, requires further study to demonstrate that multiple methods of documentation may yield accurate results. There are several sources of error that could have been problematic in the current study. Error could have occurred if examiners counted the same mark multiple times, leading to a high count, or if “unimportant” marks were ignored, resulting in a decreased count. This source of error was mitigated in Part II, as observers were required to mark the images themselves, not give a total count of features. Additionally, in Part I, the fourteen hand regions were not drawn on each hand, and it was up to the examiner to define these regions on each image. Marks on the edges of the regions could be counted in one, both, or neither region. Again, this limitation was mitigated in Part II of the study. Finally, unclear instructions may have led to confusion on what to count. In future study, clarification should be given to specify which skin features are “important,” that is, permanent and identifiable. To determine the importance of the features, examiners will establish which features, if not repeated in other images, would be a cause for concern. Future research should address the areas of concern mentioned above, but should also focus on examining several other important aspects of image comparisons of the hand. First, research should involve experts who have been trained in hand comparisons. While the present research did not show a relationship between experience and reliability, it would be important to determine whether more specialized training in hand comparisons may alter this finding. Additionally, such future studies may lead to recommended training programs and standardized procedures for image comparisons. Examiners conducting hand comparisons should also be surveyed to determine the education, training, and background that are currently accepted in the field. Second, an expanded sample of hands would provide additional insight into the variability of features. Through the examination of more individuals, it would be possible to determine how other factors, such as sex, age, and genetics, influence the appearance and significance of pigmented lesions. Furthermore, an analysis of hands over time would also aid in assessing the permanence and reliability of particular features. The continued studies on image comparisons of the hand will also intertwine with the field of biometrics, specifically in regard to the recent studies concerning automated detection of skin detail (9–14). The application of such automated methods in forensic image comparison is an additional area for future research. Finally, while the present study focused on the observer reliability when examining images of the hand, it is imperative to generate statistics on comparisons between known and questioned images of hands. Examiners should be asked to conduct comparisons of images of hand and to record any conclusions and characteristics influencing their decisions. This study demonstrates that not all observers are looking at images in the same manner. While there is a lack of standardization in how features are documented, it does not invalidate the use of such features in a photographic comparison. Future study is warranted to examine how successful examiners are when tested with comparisons leading to a conclusion in regard to individualization. While the method may not be the same, it is important to determine whether the results can be validated without a standardized method. At the same time, a count-based 1612 JOURNAL OF FORENSIC SCIENCES method may provide a valuable technique that will yield numerical data for additional analysis. In this case, further research and standardization are needed. References 1. Raymond T. The future of forensic science. Aust J Forensic Sci 2006;38:3–21. 2. Saks M, Koehler J. The coming paradigm shift in forensic identification science. Science 2005;309:892–5. 3. Saks M, Koehler J. The individualization fallacy in forensic science evidence. Vanderbilt Law Rev 2008;61:199–219. 4. Tuthill H. Individualization: principles and procedures in criminalistics. Salem, OR: Lightning Powder Company Inc, 1994. 5. Spaun N, Vorder Bruegge R. Forensic identification of people from images and video. Proceedings of the 2nd IEEE Conference on Biometrics: Theory, Applications and Systems (BTAS 2008); 2008 Sept 29–Oct 1; Washington, DC. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2008;1–4. 6. Evison M, Voder Bruegge R. The magna database: a database of three-dimensional facial images for research in human identification and recognition. Forensic Sci Commun 2008;10(2):1–11. 7. White R, Perednia D, Schowengerdt R. Automated feature detection in digital images of skin. Comput Methods Programs Biomed 1991;34:41– 60. 8. Sarkar I, Alisherov F, Kim T, Bhattachrya D. Palm vein authentication system: a review. Int J Control Autom Syst 2010;3(1):27–33. 9. Park U. Face recognition: face in video, age invariance, and facial marks [PhD dissertation]. East Lansing, MI: Michigan State University, 2009. 10. Jain A, Lee J. Scars, marks, and tattoos: a soft biometric for identifying suspects and victims. SPIE Newsroom 2009, 15 June 2009, SPIE Newsroom. DOI: 10.1117/2.1200906.1282. 11. Jain A, Park U. Facial marks: soft biometric for face recognition. Proceedings of the 16th IEEE International Conference on Image Processing (ICIP 2009); 2009 Nov 7–10; Cairo, Egypt. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2009;37–40. 12. Lee J, Jain A, Jin R. Scars, marks, and tattoos (SMT): soft biometric for suspect and victim identification. Proceedings of the 2008 Biometrics Symposium (BSYM); 2008 Sept 23–25; Tampa, FL. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2008;1–8. 13. Lin D, Tang X. From macrocosm to microcosm. Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW ‘08); 2008 June 23–28; Anchorage, AL. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2008;1355–62. 14. Pierrard J, Vetter T. Skin detail analysis for face recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ‘07); 2007 June 17–22; Minneapolis, MN. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2007;1–8. 15. Edmond G, Biber K, Kempt R, Porter G. Law’s looking glass: expert identification evidence derived from photographic and video images. Curr Issues Crim Justice 2009;20:337–77. Additional information and reprint requests: Christina A. Malone, M.F.S. Documents and Digital Evidence Branch U.S. Army Criminal Investigation Laboratory Defense Forensic Science Center Forest Park GA 30297 E-mail: christina.a.malone.civ@mail.mil