scipt.stats Improvements: 2015

domingo, 23 de agosto de 2015

Goodbye GSoC.

Hello all,

This is my final entry for the GSoC. It's been one hell of a ride, nothing short of life changing.

When I started looking for a project, the idea was that even if I didn't get selected to participate, I was going to be able to make my first contribution to OpenSource, which in itself was enough motivation to try it.

I found several interesting projects and decided to apply to one of them, the one that I considered to better suit my situation at that moment. Before I got selected I had already interacted with a few members of the community and made a couple of contributions. I was hooked on OpenSource so there was no looking back. By the time I got selected, the GSoC had already met my expectations.

I found a healthy community in SciPy and I could not have asked for a better mentor than Ralf (Gommers). The community members were always involved and supportive while Ralf provided me with enough guidance to understand new concepts in a simple way (I'm no statistician) but not so much that I would be overwhelmed by the information and I still had room to learn by myself, which is an essential part of my learning process (that's where I find the motivation to do the little things).

After starting the GSoC I received the news that I was denied the scholarship to attend Sheffield and my plans for a master's degree were almost derailed. I then got an offer to study at the University of Toronto and this is where it got interesting (spoiler alert: I am writing this blog entry from downtown Toronto).

I went through the scholarship process again and got selected. I also went through the process of selecting my courses at the UofT. With Ralf's guidance and after some research I decided to take courses on Machine Learning, Natural Language Processing and other related topics.

I can now say with pride that I am the newest member of the SciPy community which will help me in my journey towards becoming a Machine Learning expert or maybe a Data Scientist, that remains to be seen, but we already have some plans on how I can keep contributing to SciPy and getting acquainted with the pandas and Numpy communities. I'd like to see what comes from there.

As you can see, I got a lot more than I had expected from this experience, which I attribute to having approached it with the idea of searching for a passion to turn into a career. Naturally I found it, so now it's time to switch gears.

I would like to use the last paragraph of this rant to give out some thanks. Thanks to Ralf for walking me along to find my own path within the beautiful world of OpenSource and Scientific Computing. Thanks to the SciPy community, especially to Josef Perktold and Evgeni Burovski for providing so much valuable feedback to my PRs. Thanks to Google for organising an event like this, helping people like me with the excuse they need to finally join OpenSource and stop leaving it for later. And of course, thanks to the people in my life that provide me with a reason to wake up and try to be a little better than the day before: My girlfriend, Hélène, who keeps my head above the water when I feel like I forgot how to swim by myself and my parents, whose love and support seem to have no end. You make me feel like I owe it to the world to be the best I can be (or try, at the very least).

jueves, 6 de agosto de 2015

Progress Report

Hello all,

The GSoC is almost over. It's been a great experience so far and if you've been following my blog you know that I have decided to continue my involvement with the community, so this is only getting started.

With that in mind and some support from my mentor (Ralf Gommers), some tasks have taken a backseat while others have gone beyond the original intended scope. Most notoriously, the NaN policy which started out as a side note to a simple issue and has become the single largest effort in the project, not just in lines of code but also in community involvement (you can follow the discussion here or the PR here).

NaN policy is now in bike-shedding phase (reaching consensus on keyword and option names) but it is only the start of a long term effort that is likely to span for months (maybe years, depending on pandas and Numpy).

The NIST test cases for one way analysis of variance (ANOVA) are also coming along nicely and once they are done I will continue with the NIST test cases for linear regression.

Right now there are no major roadblocks but it is worth mentioning that Ralf and I have agreed to move the pencils down date to Aug 18th. This is due to the fact that I have to move to Canada soon to begin my master's degree, and this way I can travel to Toronto on Aug 20th to look for a place to live and also spend some quality time on vacation with my girlfriend Hélène, who has been a great support for me during this transition in my life. I feel like she has earned it just as much as I have.

Classes begin on Sept 16th. Once I feel like I'm settled into the new rhythm, I will get back to work picking up on loose ends or side tasks (like extending NaN policy's coverage) so the project will not suffer. I would also seek Ralf's guidance to start integrating myself into the pandas, numpy and possibly scikit-learn communities because I plan to steer my career towards data science, machine learning and that sort of stuff.

I will need to figure out where my motivation takes me, but this is a challenge that makes me feel excited about the future. GSoC may be almost done, but for me this is only just beginning and I could not be happier. As always, thank you for taking the time to read about my life.

Until next time,
Abraham.

viernes, 24 de julio de 2015

A glimpse into the future (my future, of course)

Hello again,

Before I get started I just want to let you know that in this post I will talk about the future of my career and moving beyond the GSoC so this will only be indirectly related to the summer of code.

As you may or may not know, I will start my MSc in Applied Computing at the University of Toronto in September (2015, in case you're reading this in the future). Well, I have decided steer towards topics like Machine Learning, Computer Vision and Natural Language Processing.

While I still don't know what I will end up having has my main area of focus nor where this new journey will take me, I am pretty sure it will have to do with Data Science and Python. I am also sure that I will keep contributing to SciPy and most likely start contributing to other related communities like NumPy, pandas and scikit-learn so you could say that the GSoC has had a positive impact by helping me find areas that make my motivation soar and introducing me to people who have been working in this realm for a very long time and know a ton of stuff that make me want to pick up a book and learn.

In my latest meeting with Ralf (my mentor), we had a discussion regarding the growing scope of the GSoC project and my concern about dealing with all the unforeseen and ambiguous details that arise along the way. He seemed oddly pleased as I proposed to keep in touch with the project even after the "pencils down" date for the GSoC. He then explained that this is the purpose of the summer of code (to bring together students and organisations) and their hope when they choose a student to participate is that he/she will become a longterm active member of the community which is precisely what I would like to do.

I have many thanks to give and there is still a lot of work to be done with the project so I will save the thank you speech for later. For now I just want to say that this has been a great experience and I have already gotten more out of it than I had hoped (which was a lot).

Until my next post,
Abraham.

Progress Report

Hello all,

A lot of stuff has happened in the last couple of weeks. The project is coming along nicely and I am now getting into some of the bulky parts of it.

There is an issue with the way NaN (not a number) checks are handled that spans beyond SciPy. Basically, there is no consensus on how to deal with NaN values when they show up. In statistics they are often assumed to be missing values (e.g. there was a problem when gathering statistic data and the value was lost), but there is also the IEEE NaN which is defined as 'undefined' and can be used to indicate out-of-domain values that may point to a bug in one's code or a similar problem.

Long story short, the outcome of this will largely depend on the way projects like pandas and Numpy decide to deal with it in the future, but right now for SciPy we decided that we should not get in the business of assuming that NaN values signify 'missing' because that is not always the case and it may end up silently hiding bugs, leading to incorrect results without the user's knowledge. Therefore, I am now implementing a backwards compatible API addition that will allow the user to define whether to ignore NaN values (asume they are missing), treat them as undefined, or raise an exception. This is a longterm effort that may span through the entire stats module and beyond so the work I am doing now is set to spearhead future development.

Another big issue is the consistency of the `scipy.stats` module with its masked arrays counterpart `scipy.mstats`. The implementation will probably not be complicated but it encompasses somewhere around 60 to 80 functions so I assume it to be a large and time consuming effort. I expect to work on this for the next month or so.

During the course of the last month or two there have been some major developments in my life that are indirectly related to the project so I feel like they should be addressed but I intend do so in a separate post. For now I bid you farewell and thank you for reading.

Cheers,
Abraham.

jueves, 2 de julio de 2015

Mid-term summary

Hello all,

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you. I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

I encourage anyone who is interested enough to go here to see my proposal or go here to see currently open tasks to find out more about the project. I will be happy to fill you in on the details if you reach me personally.

Sincerely,
Abraham.

jueves, 18 de junio de 2015

Scipy and the first few GSoC weeks

Hi all,

We're about three (and a half) weeks into the GSoC and it's been one crazy ride so far. Being my first experience working in OpenSource projects and not being much of an expert in statistics was challenging at first, but I think I might be getting the hang of it now.

First off, for those of you still wondering what I'm actually doing, here is an abridged version of the abstract from my proposal to the GSoC (or you can click here for the full proposal):

"scipy.stats is one of the largest and most heavily used modules in Scipy. [...] it must be ensured that the quality of this module is up to par and [..] there are still some milestones to be reached. [...] Milestones include a number of enhancements and [...] maintenance issues; most of the scope is already outlined and described by the community in the form of open issues or proposed enhancements."

So basically, the bulk of my project consists on getting to work on open issues for the StatisticsCleanup milestone within the statistics module of SciPy (a Python-based OpenSource library for scientific computing). I suppose this is an unusual approach for a GSoC project since it focuses on maintaining and streamlining an already stable module (in preparation for the release of SciPy 1.0), rather than adding a new module or a specific function within.

The unusual approach allows me to make several small contributions and it gives me a wide (although not as deep) scope, rather than a narrow one. This is precisely the reason why I chose it. I feel like I can benefit (and contribute) a lot more this way, while I get acquainted with the OpenSource way and it also helps me to find new personal interests (win-win).

However, there are also some nuances that may be uncommon. During the first few weeks I have discovered that my proposal did not account for the normal life-cycle of issues and PRs in scipy; my estimations we're too hopeful.

One of OpenSource's greatest strengths is the community getting involved in peer reviews; this allows a developer to "in the face of ambiguity, refuse the temptation to guess". If you didn't get that [spoiler alert] it was a reference to the zen of python (and if you're still reading this and your name is Hélène, I love you).

The problem with this is that even the smooth PRs can take much longer than one week to be merged because of the back and forth with feedback from the community and code update (if it's a controversial topic, discussions can take months). Originally, I had planned to work on four or five open issues a week, have the PRs merged and then continue with the next four or five issues for the next week but this was too naive so I have had to make some changes.

I spent the last week compiling a list of next steps for pretty much all of the open issues and I am now trying to work on as many as I can at a time, thus minimising the impact of waiting periods between feedback cycles for each PR. I can already feel the snowball effect it is having on the project and on my motivation. I am learning a lot more (and in less time) than before which was the whole idea behind doing the Summer of Code.

I will get back in touch soon. I feel like I have rambled on for too long, so I will stop and let you continue to be awesome and get on with your day.

Cheers,
Abraham.

domingo, 7 de junio de 2015

My motivation and how I got started

Hello all,

It's been a busy couple of weeks. The GSoC has officially begun and I've been coding away but before I go heavy into details, I think I should give a brief introduction on how I found SciPy and my motivations as well as the reasons why I think I got selected.

The first thing to know is that this is my first time contributing to OpenSource. I had been wanting to get into it for quite a while but I just didn't know where to start. I thought the GSoC was the perfect opportunity. I would have a list of interesting organisations with many sorts of projects and an outline of the requirements to be selected which I could use as a roadmap for my integration with the OpenSource community. Being selected provided an extra motivation and having deadlines was perfect to make sure I stuck to it.

I started searching for a project that was novice friendly, preferably in python because I'm good at it and I enjoy using it but of course, the project had to be interesting. Long story short, I found in SciPy a healthy and welcoming community so I decided this might be the perfect fit for me.

The first thing I did was try find an easy-fix issue to get the ball rolling by making my first contribution and allowing one thing to lead to another, which is exactly what happened; before I knew it I was getting familiarised with the code, getting involved in discussions and exchanging ideas with some of the most active members of the SciPy community.

In short, what I'm trying to say is: find your motivation, then find something that suits that motivation and get involved, do your homework and start contributing. Become active in the community and things will follow. Even if you don't make it into the GSoC, joining a community is a great learning opportunity.

Cheers,
Abraham.

martes, 5 de mayo de 2015

My GSoC experience

Hello all,

My name is Abraham Escalante and I'm a mexican software engineer. The purpose of this blog is to relate my experiences and motivations to participate in the 2015 Google Summer of Code.

I am not much of a blogger (in fact, this is my first blog entry ever) but if you got here, then chances are you are interested in either the GSoC, the personal experience of a GSoCer or maybe we have a relationship of some sort and you have a personal interest (I'm looking at you Hélène). Either way, I will do my best to walk you through my experience with the hope that this may turn out to be useful for someone in the future, be it to help you get into the GSoC programme or just to get to know me a little better if you find that interesting enough.

I have some catching up to do because this journey started for me several months ago. The list of selected student proposals has already been published (**spoiler alert** I got selected. You can take a look at my proposal here) and the coding period will start in about three weeks time but for now I just wanted to write a first entry to get the ball rolling and so you get an idea of what you can expect, should you choose to continue reading these blog entries. I will begin my storytelling soon.

Cheers,

Abraham.