This README.txt file was generated on 4th August 2021 by Li Tsz On

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset: Supporting data for "Automatic and Efficient Privacy Preserving and Fault Detection Techniques for Big-data Systems"

2. Author Information

First and Corresponding Author Contact Information
    Name: Li Tsz On
    Faculty: Faculty of Engineering
    Email: u3518490@connect.hku.hk

---------------------
DATA & FILE OVERVIEW
---------------------

Directory of Files:
   A. Filename: upa_error.py     
      Short description: UPA’s and FLEX’s RMSE between computed sensitivities (log-scale) and the ground truth.     
        
   B. Filename: upa_performanceScalability.py       
      Short description: UPA’s performance scalability to dataset sizes.

   C. Filename: upa_overhead.py       
      Short description: UPA’s execution time normalized to vanilla Spark.        

   D. Filename: upa_performanceSampleScalability.py
      Short description: UPA’s performance scalability to sample size.

   E. Filename: gupa_error.py
      Short description: GUPA’s and FLEX’s ME in enforcing ε1-gDP and ε200-gDP.

   F. Filename: gupa_performance.py   
      Short description: GUPA and UPA’s execution time normalized to vanilla Spark. 

   G. Filename: gupa_performanceDistanceScalability.py   
      Short description: GUPA’s performance scalability to distance value k.   

   H. Filename: gupa_performanceDataScalability.py   
      Short description: GUPA’s performance scalability to dataset sizes.  

   I. Filename: gupa_errorSensitivity.xlsx   
      Short description: The local sensitivity value inferred by GUPA and FLEX. 

   J. Filename: gupa_performanceBreakdown.xlsx
      Short description: Breakdown of GUPA’s execution time.
    
   K. Filename: themis_faultDetectionCapability.png
      Short description: Short Correlation between the number of faults identified by DLS testing and the error rate of a DLS. 

   L. Filename: themis_numberOfFaults.png
      Short description: Number of faults detected by Themis and baselines.  

   M. Filename: themis_retrainAccuracy.png
      Short description: Increase in DNN’s accuracy after retraining the DNN with faults detected by DLS testing.   

   N. Filename: themis_performanceBreakdown.png
      Short description: Increase in DNN’s accuracy after retraining the DNN with faults detected by DLS testing. 

   O. Filename: themis_correlationMean.py
      Short description: The mean of fault-occurring probabilities (FoP) inferred by Themis for each evaluated DNN.

   P. Filename: themis_retrainDiversity.py
      Short description: Detected faults’ diversity (the higher the better).

   Q. Filename: themis_sensitivitySample.py
      Short description: Themis’s coverage values for different sample sizes.

   R. Filename: themis_sensitivityThreshold.py
      Short description: Themis’s coverage values for different threshold values.
   
   S. Filename: themis_performanceE2e.py
      Short description: Average time taken by DLS testing to complete testing (i.e., achieve 100% test coverage).             

File Naming Convention: All filenames are in the format of "A_B.C". "A" is the work introduced in Chapter 3 to Chapter 5 of the thesis (i.e., "upa", "gupa", and "thesis"). "B" is the content of the data (e.g., "upa_performanceScalability.py" is about UPA's performance scalability). "C" is the file's format: ".py" is a Python script, "xlsx" is an excel file, "png" is an image file. 

-----------------------------------------
DATA DESCRIPTION FOR: (A) upa_error.py 
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 18


3. Missing data codes: n/a


4. Variable List

    A. Name: System names
       Description: Systems for evaluation ("UPA" and "FLEX").


    B. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

-----------------------------------------
DATA DESCRIPTION FOR: (B) upa_performanceScalability.py 
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 90


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").


    B. Name: Dataset sizes
       Description: UPA's dataset sizes ('1M','2M','3M','4M','5M','6M','7M','8M','9M','10M').

-----------------------------------------
DATA DESCRIPTION FOR: (C) upa_overhead.py  
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 18


3. Missing data codes: n/a


4. Variable List

    A. Name: System names
       Description: Systems for evaluation ("UPA" and "FLEX").


    B. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

-----------------------------------------
DATA DESCRIPTION FOR: (D) upa_performanceSampleScalability.py  
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 36


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").


    B. Name: Sample sizes
       Description: UPA's sample sizes ('$10^2$','$10^3$','$10^4$','$10^5$').

-----------------------------------------
DATA DESCRIPTION FOR: (E) gupa_error.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 36


3. Missing data codes: n/a


4. Variable List

    A. Name: System names
       Description: Systems for evaluation ("UPA-$\epsilon_{1}$","FLEX-$\epsilon_{1}$", "UPA-$\epsilon_{200}$","FLEX-$\epsilon_{200}$").


    B. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

-----------------------------------------
DATA DESCRIPTION FOR: (F) gupa_performance.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 18


3. Missing data codes: n/a


4. Variable List

    A. Name: System names
       Description: Systems for evaluation ("UPA-$\epsilon_{200}$", "gUPA-$\epsilon_{200}$").


    B. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

-----------------------------------------
DATA DESCRIPTION FOR: (G) gupa_performanceDistanceScalability.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 36


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").


    B. Name: Distance Value
       Description: GUPA's distance value ('0','1','10','100','1000','10000').

-----------------------------------------
DATA DESCRIPTION FOR: (H) gupa_performanceDataScalability.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 36


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").


    B. Name: Dataset size
       Description: Dataset size (TB) of each query ('0.1','0.2','0.4','0.6','0.8','1.0').

-----------------------------------------
DATA DESCRIPTION FOR: (I) gupa_errorSensitivity.xlsx
-----------------------------------------


1. Number of variables: 4 


2. Number of cases/rows: 63


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

    B. Name: Output Value
       Description: GUPA's output value for each query.

    C. Name: Local sensitivity at Distance 1
       Description: Local Sensitivity Values at Distance 1 inferred by FLEX, UPA and the actual value.

    D. Name: Local sensitivity at Distance 200
       Description: Local Sensitivity Values at Distance 200 inferred by FLEX, UPA and the actual value.

-----------------------------------------
DATA DESCRIPTION FOR: (J) gupa_performanceBreakdown.xlsx
-----------------------------------------


1. Number of variables: 7 


2. Number of cases/rows: 54


3. Missing data codes: n/a


4. Variable List

    A. Name: Evaluated queries
       Description: Big-data queries for evaluation ("TPCH1", "TPCH4", "TPCH13", "TPCH16","TPCH21","KMEANS", "LR", "TPCH6","TPCH11").

    B. Name: Sampling
       Description: GUPA's execution time in sampling.

    C. Name: Computing non-sampled input
       Description: GUPA's execution time in computing non-sampled input.

    D. Name: VRR and HRR
       Description: GUPA's execution time in VRR and HRR.

    E. Name: Overall execution time to enforce DP
       Description: GUPA's overall execution time to enforce DP.

    F. Name: Original execution time for a query
       Description: GUPA's original execution time for a query.

-----------------------------------------
DATA DESCRIPTION FOR: (K) themis_faultDetectionCapability.png
-----------------------------------------


1. Number of variables: 4 


2. Number of cases/rows: 150


3. Missing data codes: n/a


4. Variable List

    A. Name: Fault Detection Capability
       Description: The Fault Detection Capability of each evaluated systems.

    B. Name: Adversarial attacks 
       Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN").

    C. Name: Systems
       Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy").

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (L) themis_numberOfFaults.png
-----------------------------------------


1. Number of variables: 4 


2. Number of cases/rows: 150


3. Missing data codes: n/a


4. Variable List

    A. Name: Number of Faults 
       Description: The number of faults detected by each system.

    B. Name: Adversarial attacks 
       Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN").

    C. Name: Systems
       Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy").

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (M) themis_retrainAccuracy.png
-----------------------------------------


1. Number of variables: 4 


2. Number of cases/rows: 150


3. Missing data codes: n/a


4. Variable List

    A. Name: Retrain Accuracy 
       Description: Improvement on a Deep Learning Model's accuracy after retraining by testing systems.

    B. Name: Adversarial attacks 
       Description: Adversarial attacks applied on input images ("CW", "FGSM", "PGD", "GAUSSIAN").

    C. Name: Systems
       Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy").

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (N) themis_performanceBreakdown.png
-----------------------------------------


1. Number of variables: 3 


2. Number of cases/rows: 80


3. Missing data codes: n/a


4. Variable List

    A. Name: Execution Time
       Description: Execution Time of Themis components ("Total Test Time","FoP Calculator","FoP Estimator","FoP Fuzzer (iterations)").

    B. Name: FoP Sampler 
       Description: With or Without FoP Sampler.

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (O) themis_correlationMean.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 40


3. Missing data codes: n/a


4. Variable List

    B. Name: Adversarial attacks 
       Description: Adversarial attacks on datasets ("CW", "FGSM","PGD","GUSSIAN")

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (P) themis_retrainDiversity.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 50


3. Missing data codes: n/a


4. Variable List

    B. Name: Systems
       Description: Deep Learning Testing Systems ("Themis", "DeepXplore","DeepGauge","DeepImportance","Surprise Adequacy for DL systems")

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (Q) themis_sensitivitySample.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 50


3. Missing data codes: n/a


4. Variable List

    B. Name: Sample size
       Description: Themis's sample size ("10^{1}","10^{2}","10^{3}","10^{4}","10^{5}")

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (R) themis_sensitivityThreshold.py
-----------------------------------------


1. Number of variables: 2 


2. Number of cases/rows: 50


3. Missing data codes: n/a


4. Variable List

    B. Name: Threshold values
       Description: Themis's threshold value for MCMC ("0.00","0.02","0.04","0.06","0.08").

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

-----------------------------------------
DATA DESCRIPTION FOR: (S) themis_performanceE2e.py
-----------------------------------------

1. Number of variables: 2 


2. Number of cases/rows: 50


3. Missing data codes: n/a


4. Variable List

    B. Name: Systems
       Description: Deep Learning Testing Systems for evaluation ("Themis", "DeepXplore", "DeepGauge", "DeepImportance","Surprise Adaquacy").

    D. Name: Datasets and models
       Description: Datasets and models for evaluation ("MNIST (LeNet-1)","MNIST (LeNet-4)","MNIST (LeNet-5)","Contagio (<200,200>)","Drebin (<200, 10>)","ImageNet (VGG-19)","ImageNet (ResNet-50)","Udacity (DAVE-2)","Cifar10 (ResNet56)","Cifar10 (DenseNet121)").

--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Software-specific information:

Name: Python
Version: 3.7
System Requirements: n/a
Open Source? (Y/N): Y