TL;DR: The following is an experience report from Cassandra H. Leung and Mike Talks on using session-based test management to test Mike’s Facebook Messenger chatbot. With SBTM and chatbots both being new territory for Cassandra, she shares her first impressions and thoughts on the two. From seeing subsequent chat logs, Mike also shares his thoughts on the testing approach, techniques used, and what testing revealed about the chatbot.
Mike Talks is a fellow writer in software testing, given to occasional speaking. He’s been working in IT for over 20 years, but shown a temperament for trying out new things whenever he can. Before this, he worked trying to use AI methods to signal process sensors for a secret government project. He’s been fascinated with machine learning ever since, it’s just taken 20 years for the world to catch up…
He currently writes about everything and anything on Medium.
When Mike sent me the link to his chatbot, it gave me the opportunity to do two things – firstly to test a chatbot (and, indeed, any AI) for the very first time, and secondly put my readings about SBTM session reports into practise.
I also wanted to try a different approach to test planning, and decided to self-impose a degree of blindness.
Spending some time with Patrick Prill and his colleagues at Quality Minds earlier this year made me realise that the vast majority of test planning I do is in my head (unwritten as opposed to fictional). While many other testers like to use mind maps, and some are in the habit of writing test scripts before any practical testing begins, I’ve very much been using a “jump straight in” approach.
This doesn’t mean I haven’t been doing any planning at all, rather that I know the features I want to visit, but I don’t have all the routes planned out. I see what happens after I get started, with a general idea of how I want things to go, but I have the flexibility to let the software’s responses lead me.
When it came to testing Mike’s chatbot, I wanted to try planning things out on paper before I even looked at the system under test, just to see how things might differ.
This is where the self-imposed blindness came in: I decided I’d plan my testing before looking at the object under test. I wanted to challenge myself without having any “clues” about the chatbot, so I didn’t look it at beforehand or ask Mike any questions about how it worked, or what it was supposed to do.
This planning is documented in my Ideas Session report, which you can read here.
In hindsight, the self-imposed blindness was an odd choice, but it made sense to me at the time. I wanted to freely generate test ideas with as little hints or bias as possible.
However, as you’ll see from the Shallow Test Session report further down, generating lots of test ideas before laying eyes on the product actually led me to generate a number of ideas that I very quickly discovered were not relevant, or could be dismissed en mass due to the chatbot’s responses. They showed a pattern of behaviour that would allow me to group many of the test ideas together and reasonably assume the outcome for all, by testing just a few.
I was already biased; I just didn’t know it. When Mike originally sent me the link to the chatbot several weeks ago, he said very little about it. But he did make a comment that there were some amusing responses to inputs about Donald Trump. Subconsciously, I took this comment and used it to form an assumption that this was a “general” chatbot – not so much designed for a specific purpose, but something I could “talk” to; that might even imitate a real person.
Once I began “hands-on” testing, I immediately realised that this bot was not what I thought it was. It was specifically designed to respond to testing queries, and declared itself as a bot from the outset. All my ideas about “fooling” the user, or it not being aware of its own bot nature were redundant. This was not going to be a Turing test.
I had very few ideas specifically geared towards testing-related inputs, due to the assumptions I’d made. I could have clicked Send Message and had a quick peek at the chatbot before starting the Ideas Session, but I also assumed that, since the chatbot existed within Facebook Messenger, the UI would be very limited. How much helpful information could I get before actually entering text and, in effect, starting to test? As it turns out, I could get at least enough to point my test ideas in the right direction.
Again, what I saw as adding to the challenge was probably just a strange choice on my part, as I’d collect information before planning out a test approach on any other system, so this shouldn’t have been any different. However, it made me think about testers who are required to write test suites before a product is ready for test, or sometimes even before it’s been built. How can they do that competently and efficiently if all they have are concepts and assumed knowledge? “Oh, you thought it would work like x? Nah, we went for y. I thought John was going to tell you…”
There were several tests I performed that I hadn’t thought of during the Ideas Session, because they were largely conceived from what the system presented to me. For example, when the chatbot gave outputs about context driven testing and named some high-profile testers in the automation space, I decided to try inputting “James Marcus Bach”. Interestingly, including James’ middle name confused the chatbot, but inputting simply “James Bach” did not. There would have been no realistic need to plan for these tests, had I not seen the other outputs.
I was reacting to what I actually saw happening, rather than sticking strictly to my (somewhat misguided) ideas I’d generated previously. For me, this act and react relationship between software and testers is what makes exploratory testing so interesting and valuable.
Using Session Reports – “The Paddling Pool Tour”
Here’s the session report for what I called the “Shallow Test Session”.
For this session, I wanted to use the ideas generated from the Ideas Session to get a high level understanding of the chatbot, as noted in the session charter. To this end, I decided to invent my own testing tour – the Paddling Pool Tour (I’ve not seen this anywhere else but let me know if you have).
As it sounds, the Paddling Pool Tour is a shallow dabble in test the waters, without getting too deep or detailed. This would also allow me to cover a number of ideas in a short space of time, without having to guess in advance which one(s) might merit more testing time. I could use the information I gathered to influence future test sessions.
For me, using session reports highlighted just how much thinking, theorising, and exploration I do when testing. I always thought it was a lot, but the time it took to gather those thoughts and decide what to filter out from my report really brought the volume to my attention.
I filtered out a lot. Maybe too much. How can that be if I wrote so much? Even as I was testing, typing, testing, typing, I noticed that I was only really recording points of interest, or behaviours I didn’t expect. What about the tests I performed that did turn out as expected, or were less interesting to me at the time? It was like they never happened…
Aren’t those just as important to report, to properly record the learnings from the session and avoid unnecessary duplication? Am I doing it “wrong”? Are these the kind of test notes one would expect to see in an SBTM session report, or is this just a sign of my inexperience using them?
As well as filtering out several thoughts and findings, I didn’t log anything under the Bugs or Issues sections. This is because I worked with the idea that the “black box” and lack of answers were part of the challenge; an all too realistic challenge for many testers. Lots of things that could be considered as bugs, I thought might also be considered as accepted limitations of the product, and therefore be rejected as bugs on a “real” project.
To determine the correct classification, I would need to speak to someone else – the developer, product owner: Mike. But I’d self-imposed the condition that I couldn’t talk to Mike about it. Perhaps that was another strange choice.
As testers, we can only point out areas in the system that feel wrong, with an explanation of why, but the final say on whether something is a bug or not usually goes to the owner of the system.
But this was more of an exercise and experiment for me to learn from. I didn’t want to bother Mike with questions about his side project. Deciding to write about our experiences came afterwards.
Something that bothered me on concluding the test session, and also when reading the report back, was the lack of summary findings. Nothing to represent my thoughts on the chatbot post-test, or any comments for future testing. No insight.
I did have thoughts on those – amongst other things – they just hadn’t been recorded. Does that all have to be done in the debrief – unrecorded – or would that go in the Test Notes section too? I tend to lean towards a tailored approach, so, “whatever works for you”.
What works for you, on your project?
Experiments like this are bound to be different from real situations (think test environment vs. production), but there are certainly things I could have done to better imitate a more realistic situation. Having said that, I was trying to challenge myself and do something different to my norm, which I did.
This exercise has confirmed that what works for me, in terms of not writing down a plan before testing, is absolutely fine. It might be different from lots of other testers, and it might not be the most suitable method in every situation; I might not even choose to use it on every occasion, but it works well for me in most of my situations.
To me, the biggest beauty of exploratory testing is being able to react to the system under test; to be able to adapt and form new test ideas with every new thing you learn about the product. Some of the tests I performed on the chatbot were purely in response to its outputs and not previously planned, but they were appropriate and valuable. They helped me learn.
Putting the Ideas Session to one side, I can certainly see how, with a little practise and refinement, session reports can be useful for documenting and auditing. However, the biggest question mark for me remains over how much information to include.
Why do I have so many thoughts? What should I do with them? It was really difficult for me to decide what to include, and I don’t think I got it right. My guess, however, is that the key is in the audience. Who are the reports generated for? Who is going to read them? Like with any form of communication, this should help to guide judgements on what would be useful to the intended audience, and what can be left bouncing around in my own mind.
Note: Between writing and publishing this post, Marcel Gehlen has published a fantastic resource on exploratory testing, complete with exercises. I highly recommend you use it!
Since I launched my chatbot earlier in the year, I’ve been fascinated by how people interact with it – what would they ask? Would they find the hidden Easter Egg items I’d put in? (Yeah, I might have hinted to Cassandra that there was something on Trump.)
I’ve seen several testers use it, and they’ve followed similar patterns – really what I’d like to call “the curiosity pattern”. They initiate conversation, try out a couple of testing based questions, throw in a few wild cards on food, but after 5 minutes, they’ve satisfied their curiosity, and are ready to end the conversation.
And that’s fine – because it’s designed as a demonstrator and play tool, and no-one is being paid to test.
But going through Cassandra’s transcript, I noticed this was the first time someone had applied a more systematic approach to really test the system.
One of the things I most enjoyed about reading Cassandra’s interaction is the satisfaction you get watching a craftsperson at work. As someone who has taught testing through mentoring, pairing and workshops, there is a delight in watching people at work exploring and discovering. Sometimes we’re so close to testing, even when we pair, we’re rarely able to take a step back and enjoy the activity for the strange breed of science and intuition it involves.
I’m going to pick out some methods that I saw Cassandra apply within her session that explores her approach, and which I loved seeing being executed.
Creep and Leap
Most of all, Cassandra used a form of creep and leap technique. That basically means “creeping” – trying a variety of similar concepts to try and find the boundary / trigger around an area. This is a great approach in AI where there are no real hard boundaries, but still, exploring to see what feels right.
When Cassandra felt she’d exhausted a topic area, she’d leap to try something radically different.
A great example of creeping – she found the chatbot responds to you saying, “hello,” so she tried different slang for “hello”, and different languages. I’d programmed a lot into the machine.
Likewise, when she discovered there was a Context Driven Tester clause, she tried asking about “context driven testing”, “james bach”, “michael bolton”, which the chatbot responded well to.
However, “james marcus bach” confused the chatbot. Something I should consider adding a new rule for. This was a bit of a surprise for me – I’d have thought “james marcus bach” would have been a partial match for “james bach” – I’ve previously seen the chatbot make partial matches before.
Breaking the Format Test Heuristic
Cassandra attempted to break the expected format – which is textual language – to see how the chatbot responded.
This included use of only emojis, sending pictures, nonsensical random characters. (Again used in a creep format of exploration around a similar theme.)
She even tried to use a SELECT SQL command to try and extract a list of users from my chatbot. BAD CASSANDRA!!! (Actually, this also would be an example of using a form of security heuristic, and Cassandra tried a few variations, including XML codes.)
In each case the chatbot did not crash, and did as it was supposed to – respond with it’s default, “I don’t understand,” message.
She could also have tried sending an audio file – I did in my testing, and unsurprisingly it triggers a default message because the chatbot is confused.
James Bach calls galumphing “doing something in an overly elaborate way”. Every so often, Cassandra would try a longer than usual sentence. Actually, typically several sentences – these would often result in the chatbot being confused.
What was interesting with these scenarios, I’d attempt to repeat them in my own chat session.
Often I’d be baffled at some of the responses where the chatbot got confused. For instance if you swear, no matter what you say, it should pick up on the f-word and respond with a taunt about bad language.
What I discovered was that with some of Cassandra’s statements, if you broke them down and fed them into the chatbot a piece at a time, it’d respond sensibly.
The example… “So what if I do? What the fuck do you care about it?” left the chatbot confused.
“So what if I do?” has a response where the chatbot talks about what it calls itself.
“What the fuck,” has a sarcastic, “Why are you swearing?” response.
“Do you care about it,” has the chatbot respond with a quote about Gary Numan: “I’m you friend electric, in cars”.
Put all those three comments into one sentence, and the chatbot is push-me, pull-me’d in three directions, and ends up confused. This was a behaviour I only realised when I put some of her sentences under the microscope, repeating parts of them with my chatbot later on.
In similar ways, I sometimes saw things that I thought should be matched up but weren’t. An example was when Cassandra asks about “Mike Talks”, it comes through with a response about me, it’s creator.
When she asked about just “Mike”, the chatbot got confused. There are a lot of responses around Mike, and it really needs context like “Mike’s name” or “contact Mike” or “Mike looks like”.
What, to me, has been interesting is reading both Cassandra’s article and the chatbot transcript. Overall, there is a sense of structure to the testing and approach she’s taken. Her testing actions are very deliberate. But that reinforced something Cassandra said earlier, that she does plan out what she wants to do, it’s just that planning occurs more in in her head, than being pre-documented. This is fine for small, story type activities like this, however for a larger testing activity which could span days (this did not), having some map or notes to work to might be helpful.
One suggestion I would have regarding potential oracles – once she realised the chatbot worked on testing material, Cassandra might have wanted to grab a textbook, and look through the index at the back, trying random terms to see how the bot responded. Likewise, knowing it was designed to service my old blog, she might have wanted to look through blogs, and try items where she knew there should be matching terms – “automation”, “Java” or “mental health”, for instance.
Another potential avenue for testing would have been to ask a question of the chatbot to see its reaction. And then ask something very similar (using the creep of creep and leap) but using a combination of bad spelling, bad grammar, or “text speak”.
Our Closing Thoughts
There were lots of things we learned from Cassandra’s chatbot testing session and our subsequent analysis.
Planning Happens, Even if Informally
The initial motivation for these testing sessions was Cassandra’s desire to try different approaches. Yes, it can seem that “all the cool kids” use mind maps, but Cassandra realised that a plan in her head is still a plan.
Testing is Experimentation
If you want to be a better tester, find time to experiment and get out of your comfort zone as a tester. A lot of testing workshops and activities revolve around this, giving you a small scale testing activity and letting you try a different way of working.
In this case, Cassandra used the chatbot to try a different testing approach. In trying to do something different, she learned more about her typical style and areas to improve on in future.
With the same goals, Mike got a team from his company involved in the Software Testing World Cup last year. Not only did they win bronze in the world rankings, Mike also went on to use what he learned for his workshop on test strategy at TestBash Brighton.
What’s interesting is that the team had never worked together before – they’re from different parts of the company. They tried to train together a few times, but attempts were disrupted by either apocalyptic Wellington weather or the aftermath of a 7.8 earthquake.
In the end, Mike used both the travel from New Zealand to Berlin, and a group sightseeing trip as an exercise. He used it to see who in the group tended to “scout ahead and report back later”, and who tended to stay put until everyone was back. Fundamentally, it allowed Mike to understand how they explored something new as a group, and during the competition, it helped the team to break down and explore the product as they had while sightseeing at Hong Kong airport and Berlin.
This habit of experimenting to learn is not unique to Cassandra and Mike. Most writers and speakers in software testing find time to explore and try out new ideas and concepts. Likewise, it’s becoming more common for innovative companies to have “10% time”, when teams will try new things and apply learning.
All and all, this has been insightful experience for both of us and we hope you enjoyed reading about it. More importantly, we hope it’s encouraged you to think about your style of testing, how you plan, the patterns you fall into, and that it’s inspired you to do a little experimentation yourself.
Please share your thoughts and experiences about anything we discussed, in the comments.
4 thoughts to “Chatbot Testing with SBTM and Mike Talks”