IBM SVS 4.0 Research and Development Status Update 6 for NYPD Oct 16, 2012 IBM Confidential © 2012 IBM Corporation. All Rights Reserved. IBM SVS 4.0 Analytics Evaluation for NYPD/LMSI  NYPD/LMSI/MTA Camera and Profile Mapping  Evaluation Process  Ground Truth Application and Framework  Accuracy Evaluation Topic Meeting Date Plan: Abandoned Object Detection 8/23 Plan: Near-Field People Search 8/28 Plan: Forensic Object Search 9/18 (Detection/Tracking/Color/Real World Metric/Speed/Size/Object Classes/Duration/Histogram) 2 Review the Presentation Modification for Forensic Object Search Plan 10/11 Start reporting results to NYPD Mid Nov (After Internal QA) (Tentative) © 2012 IBM Corporation NYPD Camera Taxonomy [2633 LMSI 265 Indoor Trar: 2012 IBM Corporation Analytics Accuracy Evaluation Process Ground Truth Application Annotate Video Files/ Cameras Ground Truth Report Evaluate Meta Data Writer SVS Analytics Analytics Results NYPD Input 1. 2. 3. 4. 4 Selection of Video Files and Cameras Ground Truth Performance Metrics Report Formats © 2012 IBM Corporation IBM SVS Ground Truth Application and Framework • Provide a framework for defining ground truth schema for each evaluation profile/task • Provide an application for annotating evaluation videos with ground truth • Provide a programming library for reading/writing ground truth data and analytics results for evaluation automation 5 © 2012 IBM Corporation SVS Abandoned Object Evaluation 2012 IBM Corporation. All Rights Reserved. Data Set – Staged and Investigated Drops NYPD Data: 557 drops, 63.24 hours (To be revised) NYPD 1 9/1/09 -> 8/3/10 Staged Drops: 291 Video Count: 24 Duration: 38.23 hrs TBD NYPD 4 Ongoing Staged Drops: TBD Video Count: TBD Duration: TBD NYPD 2 8/26/10 -> 10/19/10 Staged Drops: 142 Video Count: 12 Duration: 11.47 hrs NYPD 3 1/10/12 -> 4/3/12 Staged Drops: 124 Video Count: 29 Duration: 13.54 hrs TBD NYPD 5 TBD Investigated Drops: TBD(Slide 22) Video Count: TBD Duration: TBD IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Data Set – Challenging Scenes Lighting Changes Camera Movement “Sun Spots” Challenging Weather Patterns Stationary/Seated Pedestrians … IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Annotation Details Offset: 02:31 Offset: 04:41 1st box: Bag dropped – wait until actor has achieved complete separation from bag 2nd box: End of abandonment period – wait until just before actor picks up or moves bag IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Performance Test Summary  4.0 Detection Rate/False-Positive Analysis  Guide 4.0 tuning efforts by analyzing its ability to detect abandoned objects, while also considering false-positives, in a variety of configurations. Data Set: Staged and Investigated Drops  4.0 Adversarial Condition Analysis on Challenging Scenes  Guide 4.0 tuning efforts by analyzing its ability to cope with particular categories of false-positives in a variety of configurations. Data Set: Challenging Scenes  Detection Rate Comparison  Compare 3.6.7 and 4.0 in terms of detection rate in their deployment configurations. Data Set: Staged and Investigated Drops  Adversarial Condition Comparison  Compare 3.6.7 and 4.0 in terms of false positive count in their deployment configurations. Data Set: Challenging Scenes  False-Positive Rate Comparison  Compare 3.6.7 and 4.0 in terms of false-positive rate in their deployment configurations. Data Set: Live cameras IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. 4.0 Detection Rate/False-Positive Analysis Release 4.0 Data Set Video1 … VideoN GT1 Config1 SSE1 Alerts1 … … … ConfigN SSEN AlertsN … … … Config1 SSE1 Alerts1 … … … ConfigN SSEN AlertsN IBM Smart Vision Suite – In-depth Briefing Evaluation Aggregation Evaluation GTN © 2012 IBM Corporation. All Rights Reserved. 4.0 Detection Rate/False-Positive Analysis Note: Synthetic Results “Detection Rate” “Signal to Noise Ratio” IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. 4.0 Adversarial Condition Analysis Release 4.0 Video1 … VideoN Config1 SSE1 Results1 … … … ConfigN SSEN ResultsN … … … Config1 SSE1 Results1 … … … ConfigN SSEN ResultsN IBM Smart Vision Suite – In-depth Briefing Aggregation © 2012 IBM Corporation. All Rights Reserved. 4.0 Adversarial Condition Analysis Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Detection Rate Comparison (3.6.7 vs. 4.0) GT1 Data Set 4.0 Deployment Configuration 3.6.7 Deployment Configuration ConfigV1 ConfigV1 SSE1 SSE1 … … … … AlertsV1 AlertsV1 Evaluation … … Aggregation AlertsVN AlertsVN Evaluation Video1 … ConfigVN ConfigVN SSE1 SSE1 GTN VideoN IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Detection Rate Comparison (3.6.7 vs. 4.0) Note: Synthetic Results “Detection Rate” “Signal to Noise Ratio” IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Adversarial Condition Comparison (3.6.7 vs. 4.0) Data Set 4.0 Deployment Configuration 3.6.7 Deployment Configuration ConfigV1 ConfigV1 SSE1 SSE1 … … … … AlertsV1 AlertsV1 Video1 … … Aggregation … ConfigVN ConfigVN SSE1 SSE1 AlertsVN AlertsVN VideoN IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Adversarial Condition Comparison (3.6.7 vs. 4.0) Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. False-Positive Rate Comparison (3.6.7 vs. 4.0) 3.6.7 Randomly Sampled Subsets of Cameras OR all cameras if enough computing resources Current Production Cameras (CPC) – 29 Cameras Tuning Set Cameras (TSC) – 14 cameras Composite Camera Set SSE1 SSE2 Alerts … C1 C2 C1 SSEN … CG FP Categorization C2 4.0 C1 C2 … … CY SSE1 C1 Rejected Cameras (RC) - 4 cameras C2 … CN CR CG + CY + CR = CN IBM Smart Vision Suite – In-depth Briefing SSE2 Alerts … SSEN © 2012 IBM Corporation. All Rights Reserved. Discussion and Revision Logs Date Description 8/23/2012 First Review with NYPD: Dr Evan Levine and Sgt. Nelson Pimentel 8/24/2012 Revised the doc to reflect the meeting input from NYPD and submitted it for NYPD review and further feedback 8/28/2012 Reviewed the changes with NYPD team 8/29/2012 Received Feedback from NYPD and responded to their questions. No changes to slides was requested. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Appendix 21 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Investigated Drops  The drops from operational cameras that Counter Terrorism would consider suspicious.  Make sure SVS 4.0 can continue to detect these cases. TBD TBD TBD TBD TBD TBD TBD TBD IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. General Data Collection Criteria  Video data captured from the camera must be extracted to configure, tune and evaluate the use-case.  Video frame rate should be at least 10 frames per second.  Video must be in a format that can be decoded by a DirectShow filter.  Staged events should not occur during the first 20 seconds of the video.  Video quality should be as good as possible. The camera, network and/or video encoding may need to be adjusted to improve quality. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Best Practices of Staging for Abandoned Object Detection  Steps should be taken to avoid the introduction of bias as much as possible as to the selection of camera and time of day (e.g., only staging during the morning hours, when pedestrian traffic is light or selecting a camera at a much higher frequency than others).  When staging multiple events in succession, do not place an object in the same or similar location as the previous object.  Staging should be as realistic as possible. Actors should avoid behavior that would not be present under normal circumstances.  With the exception of the actor, personnel involved in staging should remain outside the camera field of view for the duration of the test, if possible.  After dropping an object, the actor should exit the camera field of view immediately, if possible.  Avoid staging “impossible” scenarios (e.g., placing a bag behind a trash can so that it is invisible to the camera). IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Near-Field People Search Evaluation 2012 IBM Corporation. All Rights Reserved. Near Field People Search Profile  Provide single attribute and combined attribute search capability on people features: Baldness, Eyeglasses, Sunglasses, Head Color, Skin Tone, and texture and tri-color combo search on torso area with 13-color Palette  Provide additional user interface to rank ordering search results based on attribute(s) confidence  Provide user interface to display confidence values of attributes of a search result  Provide complementary capability integration with Facial Recognition Engine for real-time identity alerting  Provide a People Search Framework to add new person attribute  Provide Color Calibration Tool to manually calibration color using artificial object colors and existing scene objects or automatically correcting colors 26 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Suitable Camera Characteristics ~40 pixels ear to ear Downward angle too high • • • • • No more than 15 degrees downward angle. Even and uniform lighting from above or slightly in front of subject which does not result in dark shadows At least 30 pixels ear-to-ear when no Facial Recognition deployed At least 32 pixels pupil-to-pupil when Facial Recognition (by Cognitec) deployed Face is large as subject passes through turnstile Too dark Too far away – face is small when subjects are passing through turnstile IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. 4.0 People Search Requirements Near Frontal: At least one frame with both eyes and mouth clearly visible Not Frontal: No frames with both eyes and mouth clearly visible Required for People Search IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Data Set Collection Process Not a Turnstile Camera MTA Cameras (807) Unsuitable Turnstile Suitable Turnstile Camera Camera 2-Minute Videos of Randomly Sampled Time Periods V1 Turnstile Cameras (265) Cameras Suitable for NFPS (66) Randomly Sampled Subset of Cameras V2 … VN C1 C2 … … CN V1 V2 … VN IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Data Set 1 (Whole Camera Network) Data Set 1a - Standard Data Set 1b – Challenging All 66 cameras classified as “suitable” 1 Video from each camera Total 2hr12 minutes footage Video uniformly sampled across 24-hour time period Mon-Sun  Sample Period 7/11/12 -> 8/7/12         All 66 cameras classified as “suitable” 1 Video from each camera Total 2hr12 minutes footage Video uniformly sampled across rush hour periods Mon-Fri  7am-8am  11:30am-1pm  4pm-6pm  Sample Period 7/30/12 -> 8/24/12 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Camera Distribution of the Suitable Camera Dataset IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Data Set 2 (Single Cameras) Data Set 2a A Good Suitable Camera Data Set 2b A Marginal Unsuitable Camera Both data sets: • 66 videos/2 minutes each • Total 2hr 12 minutes footage • Sample Period 7/11/12 -> 8/7/12 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Annotation Details  Each person whose face is visible for at least one frame is annotated.  The following event attributes are included (defaults are in bold): 1. Eye Region: Sunglasses = {Yes,No,Unknown}, Eyeglasses = {Yes,No,Unknown} 2. Head Region: Bald = {Yes,No,Unknown}, Head Color = {red, green, blue, yellow, cyan, magenta, brown, beige, orange, black, white, light gray, dark gray, unknown}, Hat = {Yes,No,Unknown} 3. Mouth Region: Beard = {Yes/No/Unknown}, Moustache = {Yes/No/Unknown} 4. Skin Tone = {Dark,Medium,Light,Unknown} 5. Torso Pattern = {Plaid, Stripes, Solid, Patterned Other, Unknown} 6. Torso Color = {red, green, blue, yellow, cyan, magenta, brown, beige, orange, black, white, light gray, dark gray, unknown} 7. Large Amount of Skin in Torso = {Yes/No} 8. Gender = {Male,Female,Unknown} 9. Age = {Child (Under 13), Adolescent (13-18), Young Adult (18-25), Adult (26-35), Adult (3645), Adult (46-60), Senior (61+), Unknown} 10. Path: Passes through turnstile while approaching camera = {Yes/No} The following frame attributes are included: 1. Location of face, eyes, mouth and torso (as bounding boxes) 2. Torso Visibility = {Visible, Off-Camera, Occluded} IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Annotation Details (in Ground Truth Application and Schema) IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Handling Pose Variations Face, both eyes, and mouth (frontal) Face and both eyes, no mouth (frontal, looking down) Face and single eye (profile, looking down) Face Only (turned away or features not distinguishable) IBM Smart Vision Suite – In-depth Briefing Face, single eye and mouth (profile) © 2012 IBM Corporation. All Rights Reserved. Special Note Annotate all People Whose Faces are Visible for at least one frame (it isn’t necessary that they move towards the camera or pass through the turnstile) IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Torso Annotation Torso not in camera view Torso Visible Align top of box with top-most shoulder point Do NOT draw box. Instead, set Local Attribute “Torso Visibility” to “Off-Camera” Torso Occluded Align bottom of box with waistline Default – no need to set IBM Smart Vision Suite – In-depth Briefing Do NOT draw box. Instead, set Local Attribute “Torso Visibility” to “Occluded” © 2012 IBM Corporation. All Rights Reserved. Torso Color • Specify colors in the torso region. • Specify up to 3 colors (if there are more than 3 colors, choose the 3 most prominent colors). • If there is only a very small amount of a given color, do not include it. • Include only clothing, apparel and items being carried. Do NOT include any color from skin tone. White Light Gray Blue Magenta White Red Black Blue IBM Smart Vision Suite – In-depth Briefing Black Yellow White White Yellow Red White Blue © 2012 IBM Corporation. All Rights Reserved. Torso Color – Accounting for Skin  If a large amount of the person’s skin is visible in the torso region, set the attribute “Large Amount of Skin in Torso” to “Yes” YES YES YES IBM Smart Vision Suite – In-depth Briefing NO NO NO © 2012 IBM Corporation. All Rights Reserved. Torso Pattern Solid Solid Torso=Solid IBM Smart Vision Suite – In-depth Briefing Stripes Plaid Patterned Other Torso=Patterned © 2012 IBM Corporation. All Rights Reserved. Path: Passes Through Turnstile While Approaching Camera If the person is approaching the camera AND passes through the turnstile (as shown above), leave the attribute value set to “Yes”. If the person does not pass through the turnstile OR is not approaching the camera, set the value to “No” IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Performance Test Summary  4.0 Face Capture Analysis  Guide 4.0 tuning efforts by analyzing its ability to detect faces, while also considering false-positives, in a variety of configurations. Data Set: Dataset 1a and 1b and Dataset 2a and 2b  Search Efficiency Comparison  Compare Search Time between chronological ordering based and ranked ordered based Data Set: Dataset 1a and 1b and Dataset 2a and 2b IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. 4.0 Face Capture Analysis For People Search – Track Alignment 𝑇𝑃 Labeled Object in Ground-Truth System Generated Face Track Key Frame 𝑇𝑃 Example Face Track (key frame indicated by yellow box) 𝐹𝑁 𝐹𝑁 𝐹𝑃 Time Time IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. 4.0 Face Capture Analysis for People Search Note: Synthetic Results Different points produced by varying Face Capture sensitivity parameter 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Average Chronological Thumbnail Search Time 𝐶𝑤 = 𝜏𝑤 𝑋 2 Person of Interest … … Chronologically ordered set of TP (size 𝑋 ) … … Chronologically ordered synthesized set of TP (size 2 𝑋 ) … … Chronologically ordered synthesized set of TP (size 3 𝑋 ) w = window size multiplier 𝜏 = 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑠𝑐𝑎𝑛 𝑎 𝑠𝑖𝑛𝑔𝑙𝑒 𝑡𝑕𝑢𝑚𝑏𝑛𝑎𝑖𝑙 (𝑒. 𝑔. , 0.5 𝑠𝑒𝑐𝑜𝑛𝑑𝑠) X = Set of people that were successfully detected and tracked by the system TP IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Average Ranked Search Time 𝑤 𝑅𝑤 = 𝑋 𝜏 𝑖∈𝑋 𝟐𝒏−𝟏 𝟐𝒏−𝟏 𝑋𝑖𝐴 𝑗=1 𝑗 𝑿= n = number of possible search filters for person 𝑿𝒊 Set of True-Positives Search Filters Rank Position ST=MED 10 𝑋1𝐴1 … EG=YES 2 𝑋1𝐴2 … 𝑋1𝐴 𝑁−1 𝑋1𝐴 𝑁 … … 13 ST=MED, EG=YES, HC=YEL, TP=SOLID, TC=WHT ST=MED, EG=YES, HC=YEL, TP=SOLID, TC=WHT, HAIR=YES IBM Smart Vision Suite – In-depth Briefing … 3 © 2012 IBM Corporation. All Rights Reserved. Comparison Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Evaluation Process Illustration Results GT Alignment Recall, Precision Aligned (TP) X Comparative Search Comparison results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Discussion and Revision Logs Dates Description 8/28/2012 First Review with NYPD: Dr Evan Levine, Dir Rich Schroeder, and Sgt. Nelson Pimentel 8/29/2012 Revised the deck based on the input from NYPD and waiting for NYPD written feedback 9/18/2012 1. Revised the doc based on feedback (Slide 27 and 31) from NYPD. 2. Reviewed and finalized the changes with NYPD. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Forensic Object Search Evaluation 𝐶𝑙𝑎𝑠𝑠 = 𝐶𝑎𝑟 𝑆𝑝𝑒𝑒𝑑 = 100𝑝𝑥 Evaluation Profiles: 𝑠𝑒𝑐 (42 𝑀𝑃𝐻) • Mid-Field People Search 𝑆𝑖𝑧𝑒 = 971𝑝𝑥 2 (79"x50"x153") • Mid-Field Vehicle Search 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 = 3.1 𝑠𝑒𝑐 • Detection-Enhanced Tracking 𝐶𝑜𝑙𝑜𝑟 = 𝑌𝑒𝑙𝑙𝑜𝑤 • Outdoor Tracking (Only Real-World Metrics) © 2012 IBM Corporation. All Rights Reserved. Evaluation Data Sets IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Camera Grouping (Whole NYPD Network) All Cameras Outdoor Cameras Suitable for Object Tracking Evaluation Indoor Cameras Suitable for Object Tracking Evaluation People Only “A” Quality (Good) Suitable for Midfield People Search Suitable for Torso and Legs Color Classification Vehicles Only Vehicles and People Suitable for Midfield Vehicle Search Suitable for Outdoor Detection Enhanced Tracking IBM Smart Vision Suite – In-depth Briefing “B” Quality (Challenging) © 2012 IBM Corporation. All Rights Reserved. Camera Group Breakdown IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Randomly Insert Annotation Markers in Each Video Data Sampling Process Randomly Sampled Subsets of Cameras from each group All Suitable Camera Groups C1 … CN Randomly Sampled Videos M1 … MN … V1 M1 … … VN MN … M1 V1 … VN … MN … M1 G1 … … … MN M1 G10 … MN C1 V1 … … CN VN … … MN V1 M1 … VN … M1 … MN … M1 … MN IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Data Sets from Suitable for Evaluation Camera Groups Each suitable camera group(10) is used to generate two types of data sets Data Set 1: 24 x 7 Samples  Randomly sample 20% of each group (total 166 cameras)  Collect 1 video sample per camera: the video is uniformly sampled across 24hour time period Mon-Sun  Generate annotation markers for each sample video Data Set 2: Rush Hour Samples  Use the same 166 cameras sampled for Data Set 1  Collect 1 video sample per camera: the video is uniformly sampled across rushhour time periods only - Mon-Fri (7am-8am, 11:30am-1pm, 4pm-6pm)  Generate annotation markers for each sample video Please Note: Different metrics use different camera groups with both data set types above IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Annotation Details IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Annotation for All Objects is Time Consuming High Activity Estimation ~1.5 hours annotation time per minute of video ≅ 45 hours for 30 minutes of video ≅ 22.5 person days for 30 minutes of video (assuming person spends 2 hours per day doing nothing but annotation) Following this approach means that only a few videos/few cameras can be annotated in a reasonable time period. But general metadata indexing profiles will be deployed on a wide variety of cameras IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Single Complete Object Track Annotation Single Video Automatically generate “Triangle Markers” on Randomly Selected Video Frames Annotate complete path of closest object Annotate complete path of closest object IBM Smart Vision Suite – In-depth Briefing No objects in scene set Object Class attribute to “None” © 2012 IBM Corporation. All Rights Reserved. Single Complete Object Track Annotation Detail Annotate the object closest to the triangle from the point that it enters the visible image until it leaves the visible image Intermediate frames are interpolated at evaluation time Object Classes • • • • • • • • • • • • • • • • • • Sub-Compact Compact Sedan Station Wagon Limousine Small Jeep Small SUV, Large SUV Small Pickup Truck, Large Pickup Truck Minivan, Normal Van RV Small School Bus, Large School Bus Transit Bus, Double Decker Bus Delivery Van Motorcycle, Moped/Scooter Bicycle Person, Person on Horse, Horse Carriage, Person on skateboard, Person on scooter, Person on roller blades Dog, cat, other animal Other-Small, Other-Medium, Other-Large If there are no object in the scene, set to “None” IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Single Frame Object Annotation Single Video Automatically generate “Diamond Markers” on Randomly Selected Video Frames Draw a bounding box around all objects in image Draw a bounding box around all objects in image IBM Smart Vision Suite – In-depth Briefing No Objects in scene, set Objects Present attribute to “No” © 2012 IBM Corporation. All Rights Reserved. Single Frame Object Annotation Detail Draw a bounding box around all moving objects in a visible image If there are no objects in scene, set to “No” IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Performance Test Summary (1/3)  Time Window Retrieval Accuracy (Unfiltered Search) Comparison Purpose: Compare 3.6.7 City Surveillance and a variety of profiles in 4.0 in terms of their ability to detect and track moving objects Camera Groups: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search, Midfield People Search-Outdoor A,B, Midfield People Search-Indoor A,B Metrics: Recall, Duplicate Rate 4.0 Profiles Tested: Midfield Vehicle Search, Outdoor Detection Enhanced Tracking, Midfield People Search  Retrieval Precision (Signal to Noise Ratio) Comparison Purpose: Compare 3.6.7 City Surveillance and a variety of profiles in 4.0 in terms of how frequently indexed events describe actual moving objects. Camera Groups: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search, Midfield People Search-Outdoor A,B, Midfield People Search-Indoor A,B Metrics: Precision 4.0 Profiles Tested: Midfield Vehicle Search, Outdoor Detection Enhanced Tracking, Midfield People Search  Object Count Accuracy (Histogram) Comparison Purpose: Compare 3.6.7 City Surveillance and a variety of profiles in 4.0 in terms of how well each estimates the number of moving objects in the scene. Camera Groups: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search, Midfield People Search-Outdoor A,B, Midfield People Search-Indoor A,B Metrics: Object Count Accuracy 4.0 Profiles Tested: Midfield Vehicle Search, Outdoor Detection Enhanced Tracking, Midfield People Search IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Performance Test Summary (2/3)  Size, Speed (Pixels) and Duration Estimation Comparison Purpose: Compare 3.6.7 City Surveillance and a variety of profiles in 4.0 in terms of how well the size, speed and duration of moving objects is estimated. Camera Groups: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search, Midfield People Search-Outdoor A,B, Midfield People Search-Indoor A,B Metrics: Size, Speed and Duration Ratios 4.0 Profiles Tested: Midfield Vehicle Search, Outdoor Detection Enhanced Tracking, Midfield People Search  Width, Height, Length, Speed (World Metric) Estimation Evaluation Purpose: Estimate how well the size and speed of moving objects are estimated in terms of real world coordinates (feet, MPH) in 4.0. Camera Groups: All Metrics: Width, Height, Length and Speed Ratios 4.0 Profiles Tested: Midfield Vehicle Search, Outdoor Detection Enhanced Tracking, Outdoor Tracking IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Performance Test Summary (3/3)  General Object Color and Type Classification Comparison Purpose: Compare 3.6.7 City Surveillance and a variety of profiles in 4.0 in terms of how accurately the color and type of moving objects is assigned. Camera Groups: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search Metrics: Color and Object type Accuracy 4.0 Profiles Tested: Outdoor Detection Enhanced Tracking, Midfield Vehicle Search (Object Color only)  Midfield People: Color Retrieval Comparison Purpose: Compare 4.0 rank ordering on different data sets (“A” and “B” quality) and against chronological ordering. Camera Groups: Midfield People Search-Outdoor A,B, Midfield People Search-Indoor A,B Metrics: Average Time to Find Person 4.0 Profiles Tested: Midfield People Search IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Time Window Retrieval Accuracy (Unfiltered Search) Alignment Sets GT TP FN KF System Generated Track Labeled Object in Ground-Truth 1 1 0 1 Key Frame 2 1 0 1 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 + 𝐹𝑁 3 1 0 3 𝐾𝐹 𝐷𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 = 𝑇𝑃 4 0 1 0 … 𝐺𝑇 0 1 0 IBM Smart Vision Suite – In-depth Briefing 𝑇𝑃 = 𝑆𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒 𝐹𝑁 = 𝑆𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 𝑤𝑖𝑡𝑕 𝑧𝑒𝑟𝑜 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒𝑠 𝐾𝐹 = 𝑆𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 to a labeled object © 2012 IBM Corporation. All Rights Reserved. Detail: Time Window Retrieval Accuracy (Unfiltered Search) Evaluation Goal: Find out how often objects annotated in triangle frames were successfully detected by the system as events. A successful detection is determined by whether or not the object was captured at least once by a track “key frame.” The capture rate is expressed as the proportion of successful detections to total objects annotated (recall). This evaluation measures the effectiveness of an “unfiltered search” on the SVS UI. In an unfiltered search, all system generated events in a specified time window are in the result set. Note that how well the object is tracked is not considered here. On the previous slide, an object is tracked well in the first example, but poorly in the second example. However both are treated equally because a key frame was successfully aligned with an annotated object in each case. Poor tracking will tend to result in poor color, size estimation, etc., which are measured separately (see subsequent slides). The second metric, duplicate rate, measures how often multiple events are generated for a single moving object. For example, a duplication rate of 3 indicates that, on average, 3 events are indexed for each detected moving object. While a duplicate rate of 1 is in some sense ideal, a value greater than 1 could be better from a practical standpoint in that multiple records may help ensure that the user doesn’t miss an object of interest while scanning through the results. At some point, duplication rate obviously becomes too large that this benefit is diminished by the need to sift through a lot of uninteresting redundant results. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Time Window Retrieval Accuracy (Unfiltered Search) Comparison Note: Synthetic Results Midfield People Search Outdoor Detection Enhanced Tracking Midfield Vehicle Search IBM Smart Vision Suite – In-depth Briefing Midfield People Search Outdoor Detection Enhanced Tracking Midfield Vehicle Search © 2012 IBM Corporation. All Rights Reserved. Retrieval Precision (Signal to Noise Ratio) Example System Results Ground-Truth (GT) Track Frame (TF) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑀 𝑖=1 Track Key Frame (KF) M = number of IBM Smart Vision Suite – In-depth Briefing 𝐾𝐹𝑖 𝐺𝑇𝑖 𝑀 𝑖=1 𝐾𝐹𝑖 frames © 2012 IBM Corporation. All Rights Reserved. Detail: Retrieval Precision (Signal to Noise Ratio) Evaluation Goal: To quantify how often indexed events describe real moving objects. On each diamond frame, the set of all key frame rectangles that intersect with ground-truth rectangles is formed and the size of these intersection sets is summed over all diamond frames to form the numerator. The size of all sets of key frames, whether they intersect with ground-truth rectangles or not, is summed over all diamond frames to form the denominator. This result is the proportion of indexed events that are moving objects, as opposed “garbage” results (e.g., key frame images that contain empty boxes). We are computing recall (slide 65) and retrieval precision separately using different annotation markers. Technically, we could compute them both on diamond markers, however there are 2 considerations that led us to decide to compute TP on triangle frames: 1. It may take a lot of diamond frames to capture a reasonable number of system key frames. Every track has just 1 key frame, so the likelihood that a key frame will be present for a particular track on a diamond frame is small. As a result, precision may have a relatively large margin of error, depending on how many diamond frames we can annotate. Due to this issue, we want to limit the metrics that depend on key frames appearing in diamond frames. 2. In triangle frames, the annotator marks both color and object class and annotates the full object trajectory across space and time. We operate on the set of true-positives (system results aligned with annotated objects in triangle frames) in the evaluation of size, speed, duration, world metrics, color and object class. This obviously wouldn't be possible if our pool of TP came from diamond frames instead, which are just single frame annotations without color designations. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Retrieval Precision Comparison Note: Synthetic Results Signal to Noise Ratio Midfield People Search Outdoor Detection Enhanced Tracking IBM Smart Vision Suite – In-depth Briefing Midfield Vehicle Search © 2012 IBM Corporation. All Rights Reserved. Object Count Accuracy Evaluation 𝐶𝑜𝑢𝑛𝑡𝑟𝑎𝑡𝑖𝑜 = 𝑀 𝑖=1 𝑀 𝑖=1 M = number of 𝑇𝐹𝑖 𝐺𝑇𝑖 frames Ground-Truth (GT) Track Frame (TF) Track Key Frame (KF) Note: Object count is expressed through the event statistics histogram on the SVS UI. The Count Ratio is the proportion of tracked objects at a point in time to the actual number of moving objects at the same point in time IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Object Count Accuracy Comparison Note: Synthetic Results Object Count Ratio Midfield People Search Outdoor Detection Enhanced Tracking IBM Smart Vision Suite – In-depth Briefing Midfield Vehicle Search © 2012 IBM Corporation. All Rights Reserved. Size, Speed and Duration Estimation Evaluation (Pixel) 𝑆𝑖𝑧𝑒𝑟𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑒𝑒𝑑𝑟𝑎𝑡𝑖𝑜 1 𝑀 1 = 𝑀 𝑀 𝑆𝑖𝑧𝑒𝑅𝑎𝑡𝑖𝑜𝑖 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑡𝑖𝑜 = 𝑖=1 𝑀 𝑆𝑝𝑒𝑒𝑑𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 1 𝑀 𝑀 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 Compute a ratio for all values in cases where there are multiple corresponding tracks 𝑀 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑦𝑠𝑡𝑒𝑚𝑠 𝑡𝑟𝑎𝑐𝑘𝑠 𝑡𝑕𝑎𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑 𝑡𝑜 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒 IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Detail: Size, Speed and Duration Estimation Evaluation (Pixel) Evaluation Goal: Determine how effectively objects can be found using size, speed (in pixel units) and duration (in seconds) filters. This evaluation operates on the set of true-positives formed by aligning annotated objects in triangle frames with system generated events expressed as key frames, which takes place in the Time Window Retrieval Accuracy evaluation. A ratio is formed for size, speed and duration, in pixel units, by dividing the values found in the ground-truth by the corresponding values generated by the system. Each ratio can be thought of as a “quality of estimation,” where a value of 1 indicates a perfect estimate, a value less than 1 an under-estimate and a value greater than 1 an overestimate. Sometimes multiple tracks are indexed for a single moving object. In these cases, a ratio is computed for each of the corresponding tracks. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Size, Speed and Duration Estimation Comparison Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Width, Height, Length, Speed (World Metric) Estimation Evaluation 57” 180 ” 68” 𝑊𝑖𝑑𝑡𝑕𝑟𝑎𝑡𝑖𝑜 1 = 𝑀 𝐻𝑒𝑖𝑔𝑕𝑡𝑟𝑎𝑡𝑖𝑜 1 = 𝑀 147” 72” 194” 𝑀 𝑊𝑖𝑑𝑡𝑕𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 𝐿𝑒𝑛𝑔𝑡𝑕𝑟𝑎𝑡𝑖𝑜 𝑀 𝐻𝑒𝑖𝑔𝑕𝑡𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 𝑆𝑝𝑒𝑒𝑑𝑟𝑎𝑡𝑖𝑜 74” 1 = 𝑀 1 = 𝑀 IBM Smart Vision Suite – In-depth Briefing 448” 102” 𝑀 = 𝐿𝑒𝑛𝑔𝑡𝑕𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 𝑀 𝑆𝑝𝑒𝑒𝑑𝑅𝑎𝑡𝑖𝑜𝑖 𝑖=1 𝑀 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑦𝑠𝑡𝑒𝑚𝑠 𝑡𝑟𝑎𝑐𝑘𝑠 𝑡𝑕𝑎𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑 𝑡𝑜 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒 © 2012 IBM Corporation. All Rights Reserved. Detail: Width, Height, Length, and Speed (World Metric) Estimation Evaluation Object Type: SUV 72” 194” 74” Evaluation Goal: Determine how effectively objects can be found using the width, height, length and speed of moving objects in real-world coordinates (inches/centimeters or miles per hour/kilometers per hour). The metrics are essentially identical to the metrics used for size and speed in pixel units. Ratios for width, height, length and speed are computed for each detected and tracked moving object and the ratios are separately averaged together to arrive at the final evaluation result. Also, “best” ratios are chosen when multiple events are indexed for a single detected object. The main difference between this evaluation and the evaluation in pixel units is how the ground-truth is used to express the true object dimensions and speed in real-world coordinates. In the image above, the dimensions come from its object type category of “SUV” in the ground-truth that was assigned by the annotator. All objects that are assigned the same object type are considered to have the same dimensions for the purpose of the evaluation. For each object type, the dimensions of a typical model in that class are used. For example, for the category of “SUV” we might use the dimensions for a Jeep Grand Cherokee. If the length of the vehicle is known, the ground-truth speed in MPH or KPH can be easily computed. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Width, Height, Length, Speed (World Metric) Estimation Evaluation Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. General Object Color and Type Classification Evaluation 𝑄𝑢𝑎𝑙𝑖𝑡𝑦𝑜𝑏𝑗 = 𝑀 1 𝑀 𝑇𝑖 𝑄𝑢𝑎𝑙𝑖𝑡𝑦𝑐𝑜𝑙𝑜𝑟 = 𝑖=1 1 𝑀 𝑀 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑦𝑠𝑡𝑒𝑚𝑠 𝑡𝑟𝑎𝑐𝑘𝑠 𝑡𝑕𝑎𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑 𝑡𝑜 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑎𝑙𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝑘𝑒𝑦 𝑓𝑟𝑎𝑚𝑒 𝑀 𝐶𝑖 𝑖=1 Corresponding System Tracks Ground-Truth T C1 C2 … Track1 C3 T C1 C2 C3 … TrackN T C1 C2 Results C3 T C 1 … 1 1 2 … 1 0 3 … 0 1 4 … 0 0 5 … 1 0.5 … … … 𝑇𝑃 IBM Smart Vision Suite – In-depth Briefing … 1 0 0 1 © 2012 IBM Corporation. All Rights Reserved. Detail: General Object Color and Type Classification Evaluation Evaluation Goal: Determine how effectively objects can be found using the color and object type filters. In this evaluation, the color and type that were assigned by the annotator for each moving object are compared against the color and type that is captured by the system. For each moving object that was successfully detected, a color value and type value are computed. If the color is correct for a given moving object, a color value of 1 is assigned. If the object type is correct, a type value of 1 is assigned. If the color is incorrect, a color value of 0 is assigned. If the object type is incorrect, a type value of 0 is assigned. In some cases, the object may be multi-colored. Both the annotator and the system can assign up to 3 colors to a single object. In these cases, the color value is the proportion of the actual colors that were captured by the system. For example, if the object is both yellow and black and the system only captures yellow, then the color value is 0.5. On the other hand, if the system captures both yellow and black, the color value is 1. The final evaluation result is the average color and type value for all correspondences between system tracks and detected moving objects. For color annotation of vehicles, the annotator is instructed to ignore any color associated with the windshield, wheels or particular lighting conditions and to only consider color on the body. Although these “extraneous” colors are part of the vehicle color in a literal sense, we think that it would be unusual for a person to include these colors when describing a vehicle. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. General Object Color and Type Classification Comparison Note: Synthetic Results Color Outdoor Detection Enhanced Tracking Object Type Midfield Vehicle Search IBM Smart Vision Suite – In-depth Briefing Outdoor Detection Enhanced Tracking © 2012 IBM Corporation. All Rights Reserved. Midfield People: Color Retrieval Evaluation 𝑤 𝐶𝑆𝑤 = 𝑋 𝜏 𝑖∈𝑋 𝟐𝒏−𝟏 𝟐𝒏−𝟏 𝑋𝑖𝐴 𝑗=1 𝑗 𝑿= n = number of possible search filters for person 𝑿𝒊 Set of True-Positives Search Filters Rank Position 8 Torso=Blue 𝑋1𝐴1 … Pants=Black 2 𝑋1𝐴2 … … … 41 Torso=Blue, Torso=Black 𝑋1𝐴 … 𝑁−1 𝑋1𝐴 𝑁 Torso=Blue, Torso=Black, Pants=Black IBM Smart Vision Suite – In-depth Briefing 3 © 2012 IBM Corporation. All Rights Reserved. Midfield People: Color Retrieval Comparison Note: Synthetic Results IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Evaluation Process Illustration Suitable Camera/Video Groups City Surveillance (3.6.7) Events Outdoor Tracking (4.0) Events Midfield People Search (4.0) Events Outdoor Detection Enhanced Tracking (4.0) Events Midfield Vehicle Search (4.0) Events Performance Results Evaluation GT Indoor, People Only, MPS “A” Quality Outdoor, Vehicles Only, Suitable for MVS Indoor, People Only, MPS “B” Quality Outdoor, People and Vehicles, Suitable for ODET Outdoor, People Only, MPS “A” Quality Outdoor, Vehicles Only, NOT Suitable for MVS Outdoor, People Only, MPS “B” Quality Outdoor, People and Vehicles, NOT Suitable for ODET IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. Discussion and Revision Logs Dates Description 9/18/2012 First Review with NYPD: Dr Evan Levine, Dir Rich Schroeder, and Sgt. Nelson Pimentel 9/21/2012 Revised the deck to include suggestion from 9/18 face-to-face review meeting. Waiting for further NYPD review and comments. 10/10/2012 Received feedback from NYPD on 9/25 and made best effort to revise the presentation and respond to their questions. A Faceto-Face review of the changes is scheduled on 10/11. 10/16/20102 Removed Counting Histogram per Dr Evan’s Suggestion. Changed “Best Match” based Metrics and its associated Slides: 73, 74, 76, 79 and 80. IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved. End of the Plan IBM Smart Vision Suite – In-depth Briefing © 2012 IBM Corporation. All Rights Reserved.