Learn about the growing practice of gegevens mining
How gegevens mining works
Every time you shop, you leave a trail of it behind. Same, when you surf the web, waterput on your fitness tracker or apply for credit at your bankgebouw. Te fact, if wij could touch it, wij’d be drowning it. The gegevens wij produce every single day, according to IBM, totals an unfathomable Two.Five quintillion bytes (that’s ’25’ followed by 17 zeros). Wij’re producing it so prompt that its estimated 90% of gegevens te the world right now wasgoed created te just the last two years. This ‘Big Gegevens’ is a global resource worth billions of dollars and every business and government wants their arms on it &ndash, and for good reason.
Gegevens is the digital history of our everyday lives &ndash, our choices, our purchases, who wij talk to, where wij go and what wij do. Wij’ve previously looked at the Internet of Things (IoT) and how ‘pervasive computing’ will radically alter the way wij live. You can ensure IoT will lead to an even greater explosion of gegevens generation and capture &ndash, thanks to the boom te cloud storage, wij’re already putting away this stuff spil prompt spil wij can go.
But gegevens on its own is pretty worthless &ndash, it’s the information wij samenvatting from the gegevens that can do everything from forewarn governments of possible terror threats, to predict what you’ll likely buy next time at your local fruit-and-veg. The sheer volume of gegevens available is well beyond human capability alone to decipher and needs laptop processing to treat it &ndash, that’s where the concept of ‘gegevens mining’ comes te.
Actually, ‘gegevens mining’ is indeed the buzzword for a fascinating area of computing called ‘machine learning’, which itself is an offshoot of Artificial Intelligence (AI). Here, computers use special code functions or ‘algorithms’ to process the mountains of gegevens and generate or ‘learn’ information from it.
It’s a flourishing area of pioneering research at the uur, which also incorporates mathematical technics very first discovered more than 250 years ago.
Te one regard, gegevens mining has a bit of a shadow personages overheen it, with growing ethical concerns about privacy and how information mined from gegevens is used. But it’s not all ‘terror plots and shopping carts’ &ndash, gegevens mining is powerfully used by the sciences for everything from weather prediction to medical research, where it’s bot used to predict recurrence of breast cancer and find indicators for the onset of suikerziekte.
Stanford University’s [email protected] disease research project is gegevens mining on a global scale you can get involved ter, searching for cures to cancer, Parkinson’s disease and Alzheimer’s.
Essentially, machine learning is about finding patterns ter gegevens, learning ‘rules’ that permit us to make decisions and predictions, or finding linksaf or ‘associations’ inbetween factors ter situations or applications.
Get the free software
Now you might think machine learning is done te labs with banks of computers, mountains of cloud storage and expensive purpose-built software. You’d be right, but it’s also something you can do at huis &ndash, what’s more, a welvoeglijk amount of machine learning software is available free. Popular examples like ‘Hadoop’ or ‘R’ provide powerful frameworks for processing mountains of gegevens, but they can be a little daunting to use, first-time out. And like Holden versus Ford or Android versus iOS, it’s a field with sultry volgers of different software.
One app commonly used for learning the basics is WEKA, developed by Fresh Zealand’s University of Waikato. Like Hadoop, it’s built using the Java programming language, so you can run it on any Windows, Linux or Mac OS X pc. It’s not ideal, but its graphical user interface (GUI) certainly helps.
How machine learning works
Machine learning starts with what’s called a ‘dataset’ signifying a situation you want to learn &ndash, think of it spil a spreadsheet. You have a series of measures or ‘attributes’ te columns, while each row represents an example or ‘example’ of the thing or ‘concept’ you want to learn.
For example, if wij’re looking for indicators of the onset of suikerziekte, those attributes could include a patient’s figure mass index (BMI), their blood-glucose levels and other medical factors. Each example would contain one patient’s set of attributes. Te this situation, the dataset would also have a result or ‘class’ attribute, indicating if the patient developed suikerziekte or not.
If another patient presents for diagnosis and wij want to know if they’re at risk of suikerziekte, machine learning can develop the rules to help predict that likelihood, based on dataset learning and that person’s measured medical attributes.
What do rules look like?
One of the earnestly cool instruments wij love at TechRadar is IFTTT (If This Then That) – a program that combines social network services to perform linked functions. Spil the name suggests, it works on the ordinary ‘if-then’ programming statement that ‘if an event occurs, then go do something’.
Basic rules ter machine learning are along the same lines &ndash, if an event X occurs, the result is Y. Or it could be a series of events &ndash, if X, Y and Z occurs, the result is A, or A, B and C.
Thesis rules tell us something about the concept wij want to learn. But just spil significant spil what the rules tell us is how accurate they are. Rule accuracy exposes how much confidence wij can have ter the rules to give us the right result.
Some rules are excellent &ndash, they get the right reaction every time, others can be hopeless and some, in-between. There are also added complications &ndash, what’s called ‘overfitting’, where a set of rules work ideally on the dataset they were learned from, but perform poorly on any fresh examples or instances given to them. Thesis are all things that machine learning &ndash, and the gegevens scientists using it &ndash, voorwaarde consider.
There are dozens of different machine learning functions or ‘algorithms’, many of them fairly ingewikkeld. But there are two elementary examples you can learn quickly called ‘ZeroR’ and ‘OneR’. Wij’ll use the WEKA app to voorstelling them, but also calculate them by-hand to see how they work.
The WEKA package includes a number of example datasets, one being a very petite ‘weather.nominal’ dataset, containing 14 instances of whether roller is played on a particular day, given a series of weather events at the time. There are five measures or ‘attributes’ &ndash, outlook, temperature, humidity, windy and play. This last one is the output or ‘class’ attribute, which says whether vloedgolf wasgoed played (yes) on that day or not (no).
ZeroR is the world’s simplest gegevens mining algorithm &ndash, well, it’s a bit rude to call it an ‘algorithm’ because it’s so ordinary, but it provides the baseline accuracy level that any zindelijk algorithm will hope to build on.
It works like this: check out the weather gegevens te the photo above, look at that ‘play’ class attribute and count up the number of ‘yes’ and ‘no’ values. You should find nine ‘yes’ values and five ‘no’. The proportion of ‘yes’ values is nine out of 14 instances. That means if wij get another example and wij want to predict whether vloedgolf will be played or not, wij can just say ‘yes’ and be right nine times out of 14 or 64.2% of the time. Te other words, ZeroR simply chooses the most popular class attribute value.
You can test this out te WEKA &ndash, make sure you have the Java Run-time Engine (JRE) installed on your PC, then download WEKA, install it and launch the app. Click on the ‘Explorer’ icon to launch the learning window. WEKA uses a modified CSV (comma-separated variable) format called ARFF and you’ll find example datasets te the /program files/weka-3-x/gegevens subfolder. Ter the Explorer window, click on the Open Opstopping button and choose the ‘weather.nominal’ dataset. Next, click on the Classify tabulator and ‘ZeroR’ should be already shown te the Classifier textbox next to the Choose button. Click on the radiobutton next to ‘Use training set’ under ‘Test Options’ on that left-side control panel and ultimately, press the Embark button.
Almost instantly, you’ll get the results on the Classifier Output window. Scroll down and you’ll see ZeroR defaults to choosing the ‘yes’ class value and straks, ‘Correctly Classified Instances’ demonstrating ‘9’ and ‘64.2857%’ next to it. Bottom-line, WEKA has just done the same thing wij did before &ndash, it counted up the ‘yes’ and ‘no’ class values and chose the most common.
One rule to rule them all
ZeroR gives us a 64.2% base-level learning accuracy te this example, but it’d be nice to do a bit better than that. That’s where the OneR algorithm comes te. It’s called a ‘classification rule learner’, ter that, given what it learns from a training dataset, it generates rules that permit us to determine or ‘classify’ the result of a future example.
If you look at the OneR table above, you can see how it works &ndash, each weather dataset attribute has a puny number of possible values. For Outlook, they are ‘sunny’, ‘overcast’ and ‘rainy’. For temperature, it’s ‘hot’, ‘mild’ and ‘cool’ and so on. Wij create a separate list for each attribute value and then count how many times each value occurs ter an example by noting the number of ‘yes’ and ‘no’ results wij get.
For example, going through the 14 instances, you can see when five instances where the outlook is sunny, providing us two ‘yes’ and three ‘no’ results. Likewise, ‘outlook = overcast’ gets four ‘yes’ votes and zero ‘no’ results. Wij then do likewise for all of the other attributes.
Next, wij count up the errors &ndash, thesis are the smaller counts for each attribute value, so again, for ‘outlook = sunny’, the ‘yes’ count is only two, for ‘outlook = overcast’, the ‘no’ count is zero, for ‘outlook = rainy’, it’s two and so on. The crimson boxes on the table voorstelling the most popular class values for each attribute value and it’s from thesis that wij make our very first set of ‘Outlook’ rules:
Outlook = sunny ->, Play = no
Outlook = overcast ->, Play = yes
Outlook = rainy ->, Play = no
Again, wij do likewise for the other attributes. What wij’re doing is taking the most popular class value for each attribute value and assigning it to that attribute-value pair to make a rule, so for this example, outlook being ‘sunny’ leads to play being ‘no’ and so on. Next, wij repeat this for each of the other three attributes. After that, wij add up those ‘error’ counts for each attribute value, so Outlook is Two + 0 + Two totaling Four out of 14 (Four/14). For temperature, wij get Five/14, Four/14 for Humidity and Five/14 for Windy.
Now, wij choose the attribute with the smallest error count. Since te this example wij have two attributes with error count of Four out of 14 (Outlook and Humidity), you can choose either – wij’ve gone with the very first one, the ‘Outlook’ attribute ruleset above.
This now becomes our ‘OneR’ (one-rule) classification rule set. Using this rule on the training dataset, it correctly predicts Ten out of 14 instances or just under 71.5%. Reminisce, ZeroR talent us 64.2%, so OneR gains us greater accuracy, which is what wij want.
Using the fresh rule
Let’s say wij’re given a fresh example &ndash, the outlook is rainy, temperature is mild, humidity is high and windy is false. What is ‘play’ &ndash, will vloedgolf be played or not? Our OneR rule says if the outlook is rainy, play is ‘no’, so that’s our response &ndash, for this example, it’s very likely (about 71.5%) there’s no vloedgolf happening today.
Run the OneR classification test te WEKA by clicking the Choose button and selecting ‘OneR’ from the ‘Rules’ list. Press the Embark button and you’ll see the same list of rules, the number of correctly classified instances at ten and a percentage of 71.4286. That’s exactly what wij calculated before.
Peak of the iceberg
Sure, wij’re not going to make millions or save the world by predicting which days golving will be played based on weather events, but if you’re a meteorologist determining if current weather conditions could lead to a massive hailstorm, gegevens mining technics (admittedly more complicated than wij’ve seen here) can help with those answers.
Machine learning is a boom area of rekentuig research around the world, aiming to make sense of the ‘death by gegevens’ overcharge of information surrounding us. Wij’ve hardly scraped the surface here, but next time you kasstuk the internet or go shopping, you’ll hopefully have a better idea of what happens to the gegevens wij generate.