This README.txt file was generated on 4th August 2021 by Li Tsz On ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Supporting data for "Automatic and Efficient Privacy Preserving and Fault Detection Techniques for Big-data Systems" 2. Author Information First and Corresponding Author Contact Information Name: Li Tsz On Faculty: Faculty of Engineering Email: u3518490@connect.hku.hk --------------------- DATA & FILE OVERVIEW --------------------- Directory of Files: A. Filename: upa_error.py Short description: UPA’s and FLEX’s RMSE between computed sensitivities (log-scale) and the ground truth. B. Filename: upa_performanceScalability.py Short description: UPA’s performance scalability to dataset sizes. C. Filename: upa_overhead.py Short description: UPA’s execution time normalized to vanilla Spark. D. Filename: upa_performanceSampleScalability.py Short description: UPA’s performance scalability to sample size. E. Filename: gupa_error.py Short description: GUPA’s and FLEX’s ME in enforcing ε1-gDP and ε200-gDP. F. Filename: gupa_performance.py Short description: GUPA and UPA’s execution time normalized to vanilla Spark. G. Filename: gupa_performanceDistanceScalability.py Short description: GUPA’s performance scalability to distance value k. H. Filename: gupa_performanceDataScalability.py Short description: GUPA’s performance scalability to dataset sizes. I. Filename: gupa_errorSensitivity.xlsx Short description: The local sensitivity value inferred by GUPA and FLEX. J. Filename: gupa_performanceBreakdown.xlsx Short description: Breakdown of GUPA’s execution time. K. Filename: themis_faultDetectionCapability.png Short description: Short Correlation between the number of faults identified by DLS testing and the error rate of a DLS. L. Filename: themis_numberOfFaults.png Short description: Number of faults detected by Themis and baselines. M. Filename: themis_retrainAccuracy.png Short description: Increase in DNN’s accuracy after retraining the DNN with faults detected by DLS testing. N. Filename: themis_performanceBreakdown.png Short description: Increase in DNN’s accuracy after retraining the DNN with faults detected by DLS testing. O. Filename: themis_correlationMean.py Short description: The mean of fault-occurring probabilities (FoP) inferred by Themis for each evaluated DNN. P. Filename: themis_retrainDiversity.py Short description: Detected faults’ diversity (the higher the better). Q. Filename: themis_sensitivitySample.py Short description: Themis’s coverage values for different sample sizes. R. Filename: themis_sensitivityThreshold.py Short description: Themis’s coverage values for different threshold values. S. Filename: themis_performanceE2e.py Short description: Average time taken by DLS testing to complete testing (i.e., achieve 100% test coverage). File Naming Convention: All filenames are in the format of "A_B.C". "A" is the work introduced in Chapter 3 to Chapter 5 of the thesis (i.e., "upa", "gupa", and "thesis"). "B" is the content of the data (e.g., "upa_performanceScalability.py" is about UPA's performance scalability). "C" is the file's format: ".py" is a Python script, "xlsx" is an excel file, "png" is an image file. ----------------------------------------- DATA DESCRIPTION FOR: (A) upa_error.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 18 3. Missing data codes: n/a 4. Variable List A. Name: System names Description: Systems for evaluation ("UPA" and "FLEX"). B. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). ----------------------------------------- DATA DESCRIPTION FOR: (B) upa_performanceScalability.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 90 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Dataset sizes Description: UPA's dataset sizes ('1M','2M','3M','4M','5M','6M','7M','8M','9M','10M'). ----------------------------------------- DATA DESCRIPTION FOR: (C) upa_overhead.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 18 3. Missing data codes: n/a 4. Variable List A. Name: System names Description: Systems for evaluation ("UPA" and "FLEX"). B. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). ----------------------------------------- DATA DESCRIPTION FOR: (D) upa_performanceSampleScalability.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 36 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Sample sizes Description: UPA's sample sizes ('$10^2$','$10^3$','$10^4$','$10^5$'). ----------------------------------------- DATA DESCRIPTION FOR: (E) gupa_error.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 36 3. Missing data codes: n/a 4. Variable List A. Name: System names Description: Systems for evaluation ("UPA-$\epsilon_{1}$","FLEX-$\epsilon_{1}$", "UPA-$\epsilon_{200}$","FLEX-$\epsilon_{200}$"). B. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). ----------------------------------------- DATA DESCRIPTION FOR: (F) gupa_performance.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 18 3. Missing data codes: n/a 4. Variable List A. Name: System names Description: Systems for evaluation ("UPA-$\epsilon_{200}$", "gUPA-$\epsilon_{200}$"). B. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). ----------------------------------------- DATA DESCRIPTION FOR: (G) gupa_performanceDistanceScalability.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 36 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Distance Value Description: GUPA's distance value ('0','1','10','100','1000','10000'). ----------------------------------------- DATA DESCRIPTION FOR: (H) gupa_performanceDataScalability.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 36 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Dataset size Description: Dataset size (TB) of each query ('0.1','0.2','0.4','0.6','0.8','1.0'). ----------------------------------------- DATA DESCRIPTION FOR: (I) gupa_errorSensitivity.xlsx ----------------------------------------- 1. Number of variables: 4 2. Number of cases/rows: 63 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Output Value Description: GUPA's output value for each query. C. Name: Local sensitivity at Distance 1 Description: Local Sensitivity Values at Distance 1 inferred by FLEX, UPA and the actual value. D. Name: Local sensitivity at Distance 200 Description: Local Sensitivity Values at Distance 200 inferred by FLEX, UPA and the actual value. ----------------------------------------- DATA DESCRIPTION FOR: (J) gupa_performanceBreakdown.xlsx ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 54 3. Missing data codes: n/a 4. Variable List A. Name: Evaluated queries Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11"). B. Name: Sampling Description: GUPA's execution time in sampling. C. Name: Computing non-sampled input Description: GUPA's execution time in computing non-sampled input. D. Name: VRR and HRR Description: GUPA's execution time in VRR and HRR. E. Name: Overall execution time to enforce DP Description: GUPA's overall execution time to enforce DP. F. Name: Original execution time for a query Description: GUPA's original execution time for a query. ----------------------------------------- DATA DESCRIPTION FOR: (K) themis_faultDetectionCapability.png ----------------------------------------- 1. Number of variables: 4 2. Number of cases/rows: 150 3. Missing data codes: n/a 4. Variable List A. Name: Fault Detection Capability Description: The Fault Detection Capability of each evaluated systems. B. Name: Adversarial attacks Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN"). C. Name: Systems Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy"). D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (L) themis_numberOfFaults.png ----------------------------------------- 1. Number of variables: 4 2. Number of cases/rows: 150 3. Missing data codes: n/a 4. Variable List A. Name: Number of Faults Description: The number of faults detected by each system. B. Name: Adversarial attacks Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN"). C. Name: Systems Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy"). D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (M) themis_retrainAccuracy.png ----------------------------------------- 1. Number of variables: 4 2. Number of cases/rows: 150 3. Missing data codes: n/a 4. Variable List A. Name: Retrain Accuracy Description: Improvement on a Deep Learning Model's accuracy after retraining by testing systems. B. Name: Adversarial attacks Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN"). C. Name: Systems Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy"). D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (N) themis_performanceBreakdown.png ----------------------------------------- 1. Number of variables: 3 2. Number of cases/rows: 80 3. Missing data codes: n/a 4. Variable List A. Name: Execution Time Description: Execution Time of Themis components ("Total Test Time","FoP Calculator","FoP Estimator","FoP Fuzzer (iterations)"). B. Name: FoP Sampler Description: With or Without FoP Sampler. D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (O) themis_correlationMean.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 40 3. Missing data codes: n/a 4. Variable List B. Name: Adversarial attacks Description: Adversarial attacks on datasets ("CW", "FGSM","PGD","GUSSIAN") D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (P) themis_retrainDiversity.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 50 3. Missing data codes: n/a 4. Variable List B. Name: Systems Description: Deep Learning Testing Systems ("Themis", "DeepXplore","DeepGauge","DeepImportance","Surprise Adequacy for DL systems") D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (Q) themis_sensitivitySample.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 50 3. Missing data codes: n/a 4. Variable List B. Name: Sample size Description: Themis's sample size ("10^{1}","10^{2}","10^{3}","10^{4}","10^{5}") D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (R) themis_sensitivityThreshold.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 50 3. Missing data codes: n/a 4. Variable List B. Name: Threshold values Description: Themis's threshold value for MCMC ("0.00","0.02","0.04","0.06","0.08"). D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). ----------------------------------------- DATA DESCRIPTION FOR: (S) themis_performanceE2e.py ----------------------------------------- 1. Number of variables: 2 2. Number of cases/rows: 50 3. Missing data codes: n/a 4. Variable List B. Name: Systems Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy"). D. Name: Datasets and models Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)"). -------------------------- METHODOLOGICAL INFORMATION -------------------------- Software-specific information: Name: Python Version: 3.7 System Requirements: n/a Open Source? (Y/N): Y