Menu

More Data Mining with Weka (1.2: Exploring the Experimenter)

6 Comments



Hello! Welcome back to New Zealand for another
few minutes of More Data Mining with Weka. By the way, I’d just like to thank all of those who did
the first course for their nice comments and feedback. You know, the University of Waikato is just a little
university on the far side of the world, but they listen. They listen when they hear feedback, and they’ve listened to you. As you can see, they’ve put me in a bigger office with more
books and a bigger plant. So, this has been great. They really appreciate the positive feedback
that we’ve had from you for the previous course. Thank you very much indeed. Today we’re going to look at the Experimenter. As you know, there are four interfaces to Weka: the Explorer, which we looked at in the last course; the Experimenter, and two more. We’re going to look at the Experimenter today
and in the next lesson as well. It’s used for things like determining the
mean and standard deviation performance of a classification algorithm on a dataset, which you did manually, actually, in the previous course. It’s easy to do several algorithms on several
datasets, and you can find out whether one classifier
is better than another on a particular dataset and whether the difference is statistically
significant or not. You can check the effect of different parameter
settings of an algorithm, and you can actually express the results of
these tests as an ARFF file. So you can sort of do data mining on the results
of data mining experiments, if you like. In the Experimenter, sometimes the computation takes days or even
weeks, and it can be distributed over several computers, like all the computers in a lab. That’s quite easy to do with the Experimenter, but we’re not going to be covering that in
this course. When you invoke the Experimenter, you get three panels: the Setup panel, the Run panel, and the Analyse panel. Before we go to those, let me just refresh your memory. This is a slide from Data Mining with Weka,
Lesson 2.3, I think, where we talked about the training set and
the test set. A basic assumption of machine learning is
that these are independent sets produced by independent sampling from an infinite population. In Lesson 2.3 — perhaps if you don’t remember
this you can go back and look at that video from the first course again — we took a dataset, segment-challenge, and the learning algorithm J48, and we used a percentage split method of evaluation. We evaluated it and got a certain figure for
the accuracy. Then we repeated that with different random
number seeds, and, in fact, we got ten different figures for the accuracy. From those we manually computed the sample
mean and the variance, and hence the standard deviation. If you can’t remember that, go and refresh your memory. Also, while we’re at it, let me just remind you about cross-validation. In Lesson 2.5 of Data Mining with Weka we looked at this technique of 10-fold cross-validation, which involves dividing the dataset into ten parts, holding out each part in turn, and averaging the results of the ten runs. Let’s get into the Experimenter. If I just go here and click Experimenter, I get the Setup panel. I’m going to start a new experiment. I’m just going to note that we’ve got 10-fold
cross-validation by default, and we’re repeating the experiment ten times
by default. I’m going to add a dataset. I’m going to add the segment-challenge dataset, which is here. I’m going to add a machine learning algorithm — I’m going to use J48. You’ve seen this kind of menu before, many, many times; it’s the same as in the Explorer. If I just select J48 and click OK, then I’ve got this dataset and this learning
algorithm. Well, let’s just run it. I’m going to go to the Run panel and click
Start. It’s running. You can see at the bottom here, it’s doing the fifth, sixth, seventh, eighth, ninth, tenth run, because we repeated the whole thing ten times. We repeated 10-fold cross-validation ten times. Now, if I go to the Analyse panel, it doesn’t show anything. I need to analyze the results of the experiment
I just did. Click Experiment. And I need to perform the test. You can see here that it’s showing for a dataset called
“segment” that we’ve got an average of 95.71% correct using this J48 algorithm. We wanted to look at the standard deviation. If I click Show std. deviations and perform the test again, then I get the standard deviation. So, we’ve effectively done what we did rather
more laboriously in the first course by doing ten individual runs. Over on the slide here, this just summarizes what we’ve done. In the Setup panel, we set things up. In the Run panel, we just clicked Start, and in the Analyse panel, we clicked Experiment, and we selected Show std. deviations and performed the test. Now, what about those detailed results of the individual
runs? I’m going to go back to the Setup panel here. I’m going to write the results to a CSV file, which we’ll call “Lesson 1.2”. I think I’ll just do a percentage split. I’ll do 90% training, 10% test. I’ve got my dataset and my machine learning
method, so I’ll just go and run. If I look at the CSV file that’s been produced, well, here it is. We repeated the experiment ten times. These are the ten different runs. And for each of these ten runs, we’ve got a lot of information. A lot of information. The information that we’re looking
for here is Percent_correct. That’s the percent correct for each of those
ten separate runs. We’ve got all sorts of other stuff here, including, for example, the user time, the elapsed time, and lots and lots of other things. Maybe you should take a look at those yourself. That’s given us the detailed results for each
of the ten runs. I’m going to do 10-fold cross-validation now. These are the ten repetitions, right, and we did a single percentage split. If I do 10-fold cross-validation, and write the result into a file, and run it again. It takes a little bit longer, because it’s doing cross-validation each time. Now it’s finished, and if we look at the resulting file, we get something that’s very similar but much
bigger. We repeated the whole thing ten times. We repeated 10-fold cross-validation ten times. This is the first run, and there were ten folds. There are ten folds of the first run. Here are the ten folds of the second run, and so on. I’ve got the same results as I had before
along here. I’ve got a very detailed account of what was
done in that experiment. Just coming back to the slides here: to get detailed results we went back to the Setup panel, and selected
CSV file, and put in a file name for the results. This is the file that we got with percentage split. Then we did the same thing for the cross-validation
experiment, and got a larger results spreadsheet. Let’s just review the Experimenter. We’ve got three panels. In the Setup panel, you can open an experiment, and you can save an experiment, but what we usually do is start a new experiment. We normally start by clicking here. There’s an Advanced mode. We’re not going to talk about the Advanced
mode here; we’re going to continue to use the simple mode of the Experimenter. You can set a file name for the results if
you want, either an ARFF file or a CSV file or, in fact, a database file. You can do either a cross-validation or a
percentage split. Actually, you can preserve the order in percentage split. The reason for that is that there’s no way
of specifying a separate test file in the Experimenter. To do that, you would kind of glue the training set and
test set together, preserve the order, and specify the appropriate percentage so that
those last instances were used as the test set. Normally, we’re not doing that, we just randomize things for the percentage
split. We’ve got the number of repetitions. We repeated the whole thing ten times, but we could have repeated it a hundred times. Here we can add new datasets. We can add more datasets. We can delete datasets that we’ve added, delete this dataset. Here we add more learning algorithms. We can just add new learning algorithms into
the learning algorithms box. That’s the Setup panel. Then there’s the Run panel. You don’t do much in the Run panel except
click Start, and just monitor for errors here. There were zero errors in the three runs I
did. Then, in the Analyse panel, you can load results from a file or a database, but what we normally want to do is click Experiment
here to get the results from the experiment we’ve just done. There are many options, and we’re going to be looking at some of these
options as we go through this course. That’s the Experimenter. We’ve learned how to open the Experimenter. We’ve looked at the Setup, Run, and Analyse panels. We’ve evaluated a classifier on a dataset
using both cross-validation repeated ten times and percentage split repeated ten times. We’ve looked at the spreadsheet output. We’ve looked at the Analyse panel. We found out how to get the mean and the standard
deviation, and we’ve looked at some of the other options
on the Setup and Run panels. There’s a chapter in the course text on the
Experimenter, Chapter 13. If you go to the activity now associated with
this lesson, you’ll do some of the things I’ve just been
doing, and more besides. Good luck, and we’re see you in the next lesson. Bye for now!

Tags: , , , , , , , , ,

6 thoughts on “More Data Mining with Weka (1.2: Exploring the Experimenter)”

  1. samwalrus says:

    Where is the data set segments available to download from ?

  2. LeTon AnhThu says:

    Can i ask a question? How to convert file .txt into .arff?

  3. Greg Hovey says:

    Loved your intro music.. Thanks for sharing!

  4. lisa liu says:

    I like very much especial the summary , but confusing about function choose cross validation, some time choose 9 some time choose 10. would it better if explain a little.

  5. Olivia Liu says:

    where can I find the activities that he mentioned in the video?

  6. Vicky Vickys says:

    I loved the intro music can you tell me where to download it and very useful Software thanks for contributing to the world

Leave a Reply

Your email address will not be published. Required fields are marked *