scipt.stats Improvements: julio 2015

viernes, 24 de julio de 2015

A glimpse into the future (my future, of course)

Hello again,

Before I get started I just want to let you know that in this post I will talk about the future of my career and moving beyond the GSoC so this will only be indirectly related to the summer of code.

As you may or may not know, I will start my MSc in Applied Computing at the University of Toronto in September (2015, in case you're reading this in the future). Well, I have decided steer towards topics like Machine Learning, Computer Vision and Natural Language Processing.

While I still don't know what I will end up having has my main area of focus nor where this new journey will take me, I am pretty sure it will have to do with Data Science and Python. I am also sure that I will keep contributing to SciPy and most likely start contributing to other related communities like NumPy, pandas and scikit-learn so you could say that the GSoC has had a positive impact by helping me find areas that make my motivation soar and introducing me to people who have been working in this realm for a very long time and know a ton of stuff that make me want to pick up a book and learn.

In my latest meeting with Ralf (my mentor), we had a discussion regarding the growing scope of the GSoC project and my concern about dealing with all the unforeseen and ambiguous details that arise along the way. He seemed oddly pleased as I proposed to keep in touch with the project even after the "pencils down" date for the GSoC. He then explained that this is the purpose of the summer of code (to bring together students and organisations) and their hope when they choose a student to participate is that he/she will become a longterm active member of the community which is precisely what I would like to do.

I have many thanks to give and there is still a lot of work to be done with the project so I will save the thank you speech for later. For now I just want to say that this has been a great experience and I have already gotten more out of it than I had hoped (which was a lot).

Until my next post,
Abraham.

Progress Report

Hello all,

A lot of stuff has happened in the last couple of weeks. The project is coming along nicely and I am now getting into some of the bulky parts of it.

There is an issue with the way NaN (not a number) checks are handled that spans beyond SciPy. Basically, there is no consensus on how to deal with NaN values when they show up. In statistics they are often assumed to be missing values (e.g. there was a problem when gathering statistic data and the value was lost), but there is also the IEEE NaN which is defined as 'undefined' and can be used to indicate out-of-domain values that may point to a bug in one's code or a similar problem.

Long story short, the outcome of this will largely depend on the way projects like pandas and Numpy decide to deal with it in the future, but right now for SciPy we decided that we should not get in the business of assuming that NaN values signify 'missing' because that is not always the case and it may end up silently hiding bugs, leading to incorrect results without the user's knowledge. Therefore, I am now implementing a backwards compatible API addition that will allow the user to define whether to ignore NaN values (asume they are missing), treat them as undefined, or raise an exception. This is a longterm effort that may span through the entire stats module and beyond so the work I am doing now is set to spearhead future development.

Another big issue is the consistency of the `scipy.stats` module with its masked arrays counterpart `scipy.mstats`. The implementation will probably not be complicated but it encompasses somewhere around 60 to 80 functions so I assume it to be a large and time consuming effort. I expect to work on this for the next month or so.

During the course of the last month or two there have been some major developments in my life that are indirectly related to the project so I feel like they should be addressed but I intend do so in a separate post. For now I bid you farewell and thank you for reading.

Cheers,
Abraham.

jueves, 2 de julio de 2015

Mid-term summary

Hello all,

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you. I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

I encourage anyone who is interested enough to go here to see my proposal or go here to see currently open tasks to find out more about the project. I will be happy to fill you in on the details if you reach me personally.

Sincerely,
Abraham.