A replicated controlled study confirms that developers’ perceptions, preferences, and opinions about software testing techniques do not reliably predict actual A replicated controlled study confirms that developers’ perceptions, preferences, and opinions about software testing techniques do not reliably predict actual

Why the Testing Method Developers Prefer Is Rarely Ever the One That Finds the Most Bugs

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

\

7 Replicated Study: Results

Of the 46 students participating in the experiment, seven did not complete the questionnaire17 and were removed from the analysis. Table 19 shows the changes in the experimental groups due to students not participating in the study. Balance is not seriously affected by mortality—although it would have been desirable that Group 5 had at least one more participant.

Additionally, another four participants did not answer all the questions and were removed from the analysis of the respective questions.

7.1 RQ1: Participants’ Perceptions as Predictors

7.1.1 RQ1.1-RQ1.5: Comparison with Original Study Results Appendix C shows the analysis of the experiment. Program is the only statistically significant variable (group, program and the program by technique interaction are not significant). In this replication, fewer defects are found in cmdline compared to nametbl and ntree, where the same amount of defects are found. Some results are in line with those obtained in the original study:

– There is no interaction with selections effect. Group is not significant.

– Mortality does not affect experimental results. The analysis technique used (Linear Mixed-Effects Models) is robust to lack of balance.

– Results cannot be generalized to other subject types.

\ But others contradict those obtained in the original study, and therefore need further investigation:

– Maturation effect cannot be ruled out. The program where lowest effectiveness is obtained is the one used the first day.

– Order of training does not seem to be affecting results. All techniques show the same effectiveness. Table 20 shows the results of participants’ perceptions for techniques. The results are the same as in the original study (χ 2 (2,N=37)=3.622, p=0.164). Our data do not support the conclusion that techniques are differently frequently perceived as being the most effective..

Our data do not support the conclusion that participants correctly perceive the most effective technique for them. The overall and per technique kappa values and 95% CI reported in Table 21 are in line with those in the original study. This suggests that the hypothesis we elaborated in the original experiment would not be correct. For some reason, perceptions are more accurate with the CR technique.

\ Again as in the original study, we have not been able to observe bias in perceptions (Stuart-Maxwell outputs (χ 2 (2, N=37)=3.103,p=0.212), and McNemar-Bowker outputs (χ 2 (3,N=37)=3.143, p=0.370)). Table 22 shows the value of Krippendorff’s α and 95% CI, overall and for each pair of techniques, for all participants and for every design group (participants that applied the same technique on the same program) separately, and Table 23 shows the value of Krippendorff’s α and 95% CI, overall and for each

program/session. Again, the results obtained are the same as in the original study. Participants do not obtain similar effectiveness values when applying the different techniques (testing the different programs) so as to be difficult to discriminate among techniques/programs.

Table 24 and Figure 2 show the cost of mismatch. As in the original study, the mismatch cost is not related to the technique perceived as being the most effective, (Kruskal-Wallis (H(2)=2.979,p=0.226)). Also, there are about the same proportion of mismatches as in the original study (48% of mismatches in the original study versus 51% in the replicated study.

However, there are some differences with respect to the original study: – While CR had the greatest number of mismatches in the original study, now it has the smallest. The number of mismatches for BT and EP has increased with respect to the original study. – In the replicated study, the mismatch cost is slightly lower (25pp compared with 31pp in the original study). The mismatch cost is smaller when CR is involved.

\ This could be due to the change in the seeded faults or just to natural variation. It should be further checked. However, it is a fact that the effectiveness of EP and BT has decreased in the replicated study, while CR has a similar effectiveness as in the original study. This suggests that the mismatch cost could be related to the faults that the program contains. However,

this issue needs to be further investigated, as we have few data points. Note that, as in the original study, the existence of few datapoints could be affecting these results.

\ Table 25 shows the average loss of effectiveness that should be expected in a project due to mismatch. The expected loss in effectiveness in a project is similar to the one observed in the original study (13pp), but this time it is related to the technique perceived as most effective (Kruskal-Wallis (H(2)=9.691, p=0.008)). This means that some mismatches are more costly than others. The misperception of CR as being the most effective technique has a lower associated cost (4pp) than for BT or EP (18pp).

\ This suggests that participants’ who think CR is the most effective might be allowed to apply this technique, as, even if they are wrong, the loss of effectiveness would be negligible. However, participants should not rely on their perceptions even in this case, since fault type could have an impact on this result and they will never know what faults there are in the program beforehand. Note that again the existence of few datapoints could be affecting these results. Therefore, this issue needs to be further researched.

\ The findings of the replicated study are:

– They confirm the results of the original study.

– A possible relationship between fault type and mismatch cost should be further investigated.

Since the results of both studies are similar, we have pooled the data and performed joint analyses for all research questions to overcome the problem of lack of power due to sample size. They are reported in Appendix D. The results confirm those obtained by each study individually. This allows us to gain confidence in the results obtained.

7.1.2 RQ1.6: Perceptions and Number of Defects Reported

One of the conclusions of the original study was that perceived technique effectiveness could match the technique with the highest number of defects reported. Table 26 shows the value of kappa and its 95% CI, overall and for each technique separately. We find that all values for kappa with respect to the perceived most effective technique and technique with greater number of defects reported are consistent with lack of agreement (κ<0.4, poor). However, the upper bound of all 95% CIs show agreement, and the lower bound of all 95% CIs but BT are greater than zero. This means that although our data do not support the conclusion that participants correctly perceive the most effective technique for them, it should not be ruled out. This means that participants perceptions about technique effectiveness could be related to reporting a greater number of defects with that technique.

As lack of agreement cannot be ruled out, we examine whether the perceptions are biased. The results of the Stuart-Maxwell test show that the null hypothesis of existence of marginal homogeneity cannot be rejected (χ 2 (2,N=37)=2.458, p=0.293). This means that we cannot conclude that perceptions and reported defects are differently distributed. Additionally, the results of the McNemarBowker test show that the null hypothesis of existence of symmetry cannot be rejected (χ 2 (3,N=37)=2.867, p=0.413). This means that we cannot conclude that there is directionality when participants’ perceptions do not match the technique with highest defects reported. The lack of a clear agreement could be due to the fact that participants do not remember exactly the number of defects found with each technique.

\ 7.1.3 RQ1.1-RQ1.2: Program Perceptions

Table 27 shows the results of participants’ perceptions for the program in which the participants detected most defects. We found that the same phenomenon applies to programs as to techniques. All three programs cannot be considered differently frequently perceived as being the ones where most defects were found, as we cannot reject the null hypothesis that the frequency distribution of the responses follows a uniform distribution (χ 2 (2,N=37)=2.649, p=0.266). Our data do not support the conclusion that programs are differently frequently perceived as having a higher percentage of defects found than the others. This contrasts with the fact that cmdline has a slightly higher complexity and number of LOC, and that ntree shows highest Halstead metrics. We expected cmdline and/or ntree should be perceived less frequently as having a higher detection rate.

However, the values for kappa in Table 28 show that there seems to be agreement overall and for cmdline and ntree (κ >0.4, fair to good and agreement by chance can be ruled out, since 0 does not belong to the 95% CI), but not so for the nametbl program (κ=0.292, poor and agreement by chance cannot be ruled out, as 0 belongs to the 95% CI). This means that participants do tend to correctly perceive the program in which they detected most defects. This is striking, as it contrasts with the disagreement for techniques. Pending the analysis of the mismatch cost, it suggests that participants’ perceptions on the percentage of defects found may be reliable. This is interesting, as cmdline has a higher complexity. Since there is agreement, we are not going to study mismatch cost. Misperceptions do not seem to affect participants’ perception of how well they have tested a program.

7.2 RQ2: Participants’ Opinions as Predictors

7.2.1 RQ2.1: Participants’ Opinions

Table 29 shows the results for participants’ opinions with respect to techniques.

With regard to the technique participants applied best (OT1), we can reject the null hypothesis that the frequency with which they perceive that they had applied each of the three techniques best is the same (χ 2 (2,N=38)=10.947, p=0.004). More people think they applied EP best, followed by both BT and CR (which merit the same opinion).

\ In the case of the technique participants liked best (OT2), the results are similar. We can reject the null hypothesis that participants equally as often regard all three techniques as being their favourite technique (χ 2 (2,N=38)=22.474, p=0.000). Most people like EP best, followed by both BT and CR (which merit the same opinion).

\ Finally, as regards the technique that participants found easiest to apply (OT3), the results are exactly the same as for the preferred technique (χ 2 (2,N=38)=22.474, p=0.000). Most people regard EP as being the technique that is easiest to apply, followed by both BT and CR (which merit the same opinion).

\ Table 30 shows the results for participants’ opinions for programs. We cannot reject the null hypothesis of all programs equally frequently being viewed as the simplest. (χ 2 (2,N=38)=1.474, p=0.479). Therefore, our data do not support the conclusion that all three programs are differently frequently perceived as being the simplest. This result suggests that both the differences in complexity and size of cmdline and the highest Halstead metrics of ntree are small. This result suggest that participants could be interpreting differently this question. Another possibility could be that the question that has been used to operationalize the corresponding construct is vague, and participants are not interpreting it correctly.

7.2.2 RQ2.2: Comparing Opinions with Reality

The technique that participants think they applied best (OT1) is not a good predictor of technique effectiveness. The overall and per technique kappa values in the fourth column of Table 31 are consistent with lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to most 95% CIs, meaning that agreement by chance cannot be ruled out). However, we find that there is a bias, as the Stuart-Maxwell and McNemar-Bowker tests can reject the null hypotheses of marginal homogeneity (χ 2 (2,N=38)=10.815, p=0.004) and symmetry (χ 2 (3, N=38)=12.067, p=0.007), respectively.

\ Looking at the light and dark grey cells in the corresponding contingency table represented in Table 32, we find that the cells placed under the diagonal have higher values than those positioned above the diagonal. In other words, there are rather more participants that consider that they applied EP best, despite achieving better effectiveness with CR and BT (9 and 5), than participants who consider that they applied CR or BT best, despite being more effective using EP (1 in both cases). This suggests that there is a bias towards EP. This bias is much more pronounced with respect to CR. These results are consistent with the ones found in the previous section.

\ There are several possible interpretations for these results: 1) we do not know if the opinion on the best applied technique is accurate (meaning that it is really the best applied technique); 2) possibly due to the change in faults, technique performance is worse in this replication than in the original study; and 3) it could be that participants have misunderstood the question. Interviewing participants or asking them in the questionnaire about the reasons for their answers, would have helped to clarify this last issue.

\ As regards participants’ favourite technique (OT2), the results are similar. This opinion does not predict technique effectiveness, since all kappa values in the fourth column of Table 31 denote lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to all 95% CIs, meaning that agreement by chance cannot be ruled out). Again, we find there is bias, as the Stuart-Maxwell and the McNemar-Bowker tests can reject the null hypotheses of marginal homogeneity (χ 2 (2,N=38)=11.931, p=0.003) and symmetry (χ 2 (3,N=38)=11.974,p=0.007), respectively. Looking at the light and dark grey cells in Table 33, we again find

that there is bias towards EP. There are rather more participants that think that they applied EP better, despite being more effective using CR and BT (12 and 5), than participants that considered that they applied CR or BT better, despite being more effective using EP (1 in both cases). Note that the bias between CR and EP is more pronounced. Note that it is very unlikely that participants have not properly interpreted this question. It just seems that the technique they most like is not typically the most effective.

Finally, with respect to the technique that is easiest to apply (OT3), we find that the results are exactly the same as for their preferred technique. However, as we have seen in OT2, their preferred technique is not a good predictor of effectiveness (see third row of Table 31), and there is bias towards EP (see light and dark grey cells in Table 33). These results are in line with a common claim in SE, namely, that developers should not base the decisions that they make on their opinions, as they are biased. Again, it should be noted that participants might not be interpreting the question as we expected. Further research is necessary.

As far as the simplest program is concerned, we find, as we did for the techniques, that it is not a good predictor of the program in which most defects were detected (the values for overall and per program kappa in Table 34 denote lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to 95% CIs— except ntree, meaning that agreement by chance cannot be ruled out). Unlike the opinions on techniques, we were not able to find any bias this time, as neither the null hypothesis of marginal homogeneity (χ 2 (2,N=38)=1.621, p=0.445) nor symmetry (χ 2 (3,N=38)=3.286, p=0.350) can be rejected. This result suggests that the programs that participants perceive to be the simplest are not necessarily the ones where most defects have been found. Again, it should be noted the problem of participants interpreting the simplest program in a different way as we expected.

7.2.3 Findings

Our findings suggest:

– Participants’ opinions should not drive their decisions.

– Participants prefer EP (think they applied it best, like it best and think that it is easier to apply), and rate CR and BT equally.

– All three programs are equally frequently perceived as being the simplest.

– The programs that the participants perceive as the simplest are not the ones where the highest number of defects have been found.

These results should be understood within the validity limits of the study.

7.3 RQ3: Comparing Perceptions and Opinions

7.3.1 RQ3.1: Comparing Perceptions and Opinions

In this section, we look at whether participants’ perceptions of technique effectiveness are biased by their opinions about the techniques. According to the results for kappa shown in the fourth column of Table 35 (PT1-OT1), we find that results are compatible with agreement (overall and per technique, except for BT in which lack of agreement cannot be ruled out) between the technique perceived to be the most effective and the technique participants think they applied best (in all cases (κ >0.4, fair to good), and in all cases but BT, 0 does not belong to 95% CIs, meaning that agreement by chance can be ruled out).

\ This is an interesting finding, as it suggests that participants think that technique effectiveness is related to how well the technique is applied. Technique performance definitely decreases if they are not applied properly. It is no less true, however, that techniques have intrinsic characteristics that may lead to some defects not being detected. In fact, the controlled experiment includes some faults that some techniques are unable to detect. A possible explanation for this result could be that the evaluation apprehension threat is materializing.

On the other hand, the kappa values in the fourth column of Table 35 (PT1-OT2) reveal a lack of agreement for CR and BT between the preferred technique and the technique perceived as being most effective (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to 95% CIs, meaning that agreement by chance cannot be ruled out), whereas overall lack of agreement cannot be ruled out ((κ <0.4, poor), the upper bound of the 95% CI shows agreement, and 0 does not belong the to 95% CI, meaning that agreement by chance can be ruled out). Finally, there is agreement (κ >0.4, fair to good) in the case of EP.

\ This means that, in the case of EP, participants tend to associate their favourite technique with the perceived most effective technique, contrary to the findings for CR and BT. This is more likely to be due to EP being the technique that many more participants like best (and the chances for there being a match are higher compared to the other techniques) than to there actually being a real match. With respect to directionality whenever there is disagreement, the results of the Stuart-Maxwell and the McNemar-Bowker tests show that the null hypotheses of marginal homogeneity (χ 2 (2,N=37)=8.355,p=0.015) and symmetry (χ 2 (3,N=37)=8.444, p=0.038) can be rejected.

\ Looking at the light grey cells in Table 36, we find that there are more participants claiming to have applied CR best that prefer EP than vice versa (8 versus 1). This means that the mismatch between the technique that participants like best and the technique that they perceive as being most effective can largely be put down to participants who like EP better perceiving CR to be more effective.

The results for the agreement between the technique that is easiest to apply and the technique that is perceived to be most effective are exactly the same as for the preferred technique (see third row of Table 35). This means that, for EP, the participants equate the technique that they find easiest to apply with the one that they regard as being most effective. This does not hold for the other two techniques. Likewise, the mismatch between the technique that is easiest to apply and the technique perceived as being most effective can be largely put down to participants who applied EP best perceiving CR to be more effective (see Table 36).

\ As mentioned earlier, we found that participants have a correct perception of the program in which they detected most defects. Table 37 shows that participants do not associate simplest program with program in which most defects were detected (PP1-OP1). This is striking as it would be logical for it to be easier to find defects in the simplest program. As illustrated by the fact that the null hypotheses of marginal homogeneity (χ 2 (2,N=37)=3.220,p=0.200) and symmetry (χ 2 (3,N=37)=4.000, p=0.261) cannot be rejected, we were not able to find bias in any of the cases where there is disagreement. A possible explanation for this result is that participants are not properly interpreting what simple means.

7.3.2 RQ3.2: Comparing Opinions

Finally, we study the possible relation between the opinions themselves. Looking at Table 38, we find that participants equate the technique they applied best with their favourite technique and with the technique they found easiest to apply (overall and per technique (κ >0.4, fair to good), and 0 does not belong to 95% CIs, meaning that agreement by chance can be ruled out). It makes sense that the technique that participants found easiest to apply should be the one that they think they applied best and like best. Typically, people like easy things (or maybe we think things are easy because we like them). In this respect, we can conclude that participants’ opinions about the techniques all have the same directional effect.

7.3.3 Findings

Our findings suggest:

– Participants’ perceptions of technique effectiveness are related to how well they think they applied the techniques. They tend to think it is they, rather than the techniques, that are the obstacle to achieving more effectiveness (a possible evaluation apprehension threat has materialized).

– We have not been able to find a relationship between the technique they like best and find easiest to apply, and perceived effectiveness. Note however, that the technique participants think they have applied best is not necessarily the one that they have really best applied.

– Participants do not associate the simplest program with the program in which they detected most defects. This could be due to participants not properly interpreting the concept ”simple”.

– Opinions are consistent with each other.

Again, these results are confined to the validity limits imposed by the study

:::info Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

:::

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 license.

:::

\

Piyasa Fırsatı
WHY Logosu
WHY Fiyatı(WHY)
$0.00000001518
$0.00000001518$0.00000001518
0.00%
USD
WHY (WHY) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

CME Group to Launch Solana and XRP Futures Options

CME Group to Launch Solana and XRP Futures Options

The post CME Group to Launch Solana and XRP Futures Options appeared on BitcoinEthereumNews.com. An announcement was made by CME Group, the largest derivatives exchanger worldwide, revealed that it would introduce options for Solana and XRP futures. It is the latest addition to CME crypto derivatives as institutions and retail investors increase their demand for Solana and XRP. CME Expands Crypto Offerings With Solana and XRP Options Launch According to a press release, the launch is scheduled for October 13, 2025, pending regulatory approval. The new products will allow traders to access options on Solana, Micro Solana, XRP, and Micro XRP futures. Expiries will be offered on business days on a monthly, and quarterly basis to provide more flexibility to market players. CME Group said the contracts are designed to meet demand from institutions, hedge funds, and active retail traders. According to Giovanni Vicioso, the launch reflects high liquidity in Solana and XRP futures. Vicioso is the Global Head of Cryptocurrency Products for the CME Group. He noted that the new contracts will provide additional tools for risk management and exposure strategies. Recently, CME XRP futures registered record open interest amid ETF approval optimism, reinforcing confidence in contract demand. Cumberland, one of the leading liquidity providers, welcomed the development and said it highlights the shift beyond Bitcoin and Ethereum. FalconX, another trading firm, added that rising digital asset treasuries are increasing the need for hedging tools on alternative tokens like Solana and XRP. High Record Trading Volumes Demand Solana and XRP Futures Solana futures and XRP continue to gain popularity since their launch earlier this year. According to CME official records, many have bought and sold more than 540,000 Solana futures contracts since March. A value that amounts to over $22 billion dollars. Solana contracts hit a record 9,000 contracts in August, worth $437 million. Open interest also set a record at 12,500 contracts.…
Paylaş
BitcoinEthereumNews2025/09/18 01:39
Polymarket Resumes Service: A Triumphant Return After Polygon Network Outage

Polymarket Resumes Service: A Triumphant Return After Polygon Network Outage

BitcoinWorld Polymarket Resumes Service: A Triumphant Return After Polygon Network Outage Polymarket, the popular prediction market platform, is back in action
Paylaş
bitcoinworld2025/12/19 01:45
A Netflix ‘KPop Demon Hunters’ Short Film Has Been Rated For Release

A Netflix ‘KPop Demon Hunters’ Short Film Has Been Rated For Release

The post A Netflix ‘KPop Demon Hunters’ Short Film Has Been Rated For Release appeared on BitcoinEthereumNews.com. KPop Demon Hunters Netflix Everyone has wondered what may be the next step for KPop Demon Hunters as an IP, given its record-breaking success on Netflix. Now, the answer may be something exactly no one predicted. According to a new filing with the MPA, something called Debut: A KPop Demon Hunters Story has been rated PG by the ratings body. It’s listed alongside some other films, and this is obviously something that has not been publicly announced. A short film could be well, very short, a few minutes, and likely no more than ten. Even that might be pushing it. Using say, Pixar shorts as a reference, most are between 4 and 8 minutes. The original movie is an hour and 36 minutes. The “Debut” in the title indicates some sort of flashback, perhaps to when HUNTR/X first arrived on the scene before they blew up. Previously, director Maggie Kang has commented about how there were more backstory components that were supposed to be in the film that were cut, but hinted those could be explored in a sequel. But perhaps some may be put into a short here. I very much doubt those scenes were fully produced and simply cut, but perhaps they were finished up for this short film here. When would Debut: KPop Demon Hunters theoretically arrive? I’m not sure the other films on the list are much help. Dead of Winter is out in less than two weeks. Mother Mary does not have a release date. Ne Zha 2 came out earlier this year. I’ve only seen news stories saying The Perfect Gamble was supposed to come out in Q1 2025, but I’ve seen no evidence that it actually has. KPop Demon Hunters Netflix It could be sooner rather than later as Netflix looks to capitalize…
Paylaş
BitcoinEthereumNews2025/09/18 02:23