Project: | scikit-learn |
Text mining has a large variety of applications and is becoming used in more businesses for gathering intelligence and providing insight. People are sending text constantly online via social media, chat rooms and blogs. Tapping into this information can help businesses gain an advantage and is increasingly a necessary skill for data analytics. Text mining is a unique data mining problem, dealing with real world data that is often heavy on artefacts, difficult to model and challenging to properly manage. Text mining can be seen as a bit of a dark art that is difficult to learn and gain traction. However some basic strategies can often be applied to get good results quite quickly, and the same basic models appear in many text mining challenges.
The scikit-learn project is a library of machine learning algorithms for the scientific python stack (numpy & scipy). It is known for having detailed documentation, a high quality of coding and a growing list of users worldwide. The documentation includes tutorials for learning machine learning as well as the library and is a great place to start for beginners wanting to learn data analytics. There is a strong focus on reusable components and useful algorithms, and the text mining sections of scikit-learn follow the “standard model” of text mining quite well.
In this presentation, we will go through the scikit-learn project for machine learning and show how to use it for text mining applications. Real world data and applications will be used, including spam detection on Twitter, predicting the author of a program and determining a user's political bent based on their social media account.
Robert is a Research Fellow at the Internet Commerce Security Laboratory (ICSL) at Federation University Australia. Robert's research investigates profiling cyber-attacks, looking for similarities between different attacks. In addition to this, his research on social media attribution has required a substantial amount of text mining in a very difficult domain. He was awarded Federation University Australia's Young Alumni of the Year for 2014 and is Deputy Director of the Centre for Informatics and Applied Optimisation. Robert is an early career research with over 30 publications in international journals and conferences in the areas of data mining, cybercrime and Internet security.
Robert is also a regular contributor to the scikit-learn project and presented at the 2013 PyCon AU. He contributions include algorithm implementations, substantial documentation updates and reviewing of other pull requests. Robert is also a mentor for one of scikit-learn's Google Summer of Code students in 2014.