A superpower every dev and data scientist should have

When you start your career in tech a big part is finding your superpower. At the beginning you might have to try different things to know what is yours. Some people are super happy working on databases, others like backend stuff, machine learning… you name it. A big part of your 2-3 first years is finding the intersection between what you like and what you are good at.

It’s normal that we look into tech related superpowers. It’s what we do most of the day. But there is another superpower that can make you stand out. Does not matter if you are a developer on a big company, founding a startup or somewhere in between. I’ve found that this superpower can change your career and help you grow in some interesting directions.

The superpower is talking to business people or to other people in general. You might not like it but it can help a lot. You might need to communicate with them for different reasons: you might want to suggest a new feature, you think the team should spend more time on an specific topic or you might need to explain a delay in the development process ( I know this never happens :-)).

Understanding how much to talk, what messages to pass on and how to structure them can be key in your professional development. And even more important, can make you enjoy your job way more. People will enjoy working with you, they’ll look you for advice and you will enjoy working with them.

You don’t need to be a smooth talker. Just focus on your audience, what are they interested in? What do they need to understand? Don’t use big words. Don’t lecture them in Computer Science or Statistic complex concepts. Understand their metrics, what they are trying to achieve and align your speech to those.

It’s never the same to say we need to implement SAML2 than to say: single sing on will make our customer life easier. Don’t say we need to re-vamp our whole backend to improve performance, say: we we’ll lower our order processing on an average of X%. Those little tricks can help you get your features built and explain better what you are trying to achieve.

I’m building a video course on this topic. If you are interested feel free to check it out on gumroad.

https://gumroad.com/js/gumroad-embed.js

Standard

Before starting a Data Science project in the company you must…

You’ve probably heard about Data Science, Machine Learning, AI and all those cool buzzwords. You have heard that some of those technologies can help you use your data to predict the future: forecast your demand, avoid fraud and recommend products to customers. It’s great that you are thinking about starting one of these projects.

Most of the times teams start thinking about the how. They focus on the algorithm, on which kind of problem they are trying to solve. Is this a classification problem, clustering or regression? This might bring some undesirable results to the project. The team may optimize the how but forget about the fuel that makes it all run: data.

Before starting you should work on a Data Quality report. It does not needs to be very complex but it needs to answer the following questions:

  • Can I trust this data?: Do you know where this data comes from? How was it collected? How much human intervention has it gone through? Is it external or from one of my own systems?
  • Do someone on the team understand this data?: Are we sure we know the unit of measurements this data should have? Does the frequency makes sense for the kind of problem we are trying to solve? Can somebody tell if an specific data point is off the charts? (Imagine having a negative number on a sales table)
  • Is the data good? There are lots of frameworks to evaluate this, but you can start with simple questions. Do I have a lot of empty rows? Do I have a lot of repeated data? Are the data types correct for all my dataset.

Once you are able to answer this three group of questions you will be in a better position to start your project. During this excercise you might find that you wont be able to use some of the data or asking around you might found about new data the you haven’t event thought about.

Standard

Detecting soccer teams using unsupervised learning and tensorflow object detection (images and videos)

In the past we have used Tensorflow Object Detection to detect sharks, social distancing and squirrels. Detecting objects is fun and we can build on top of that. Our main task will be to detect the two teams on a soccer field. We will use Tensorflow Object Detection to detect the people and then we’ll use unsupervised learning to cluster the people objects based on their shirt color. We’ll use k-means to cluster the people objects.

We’ll start with the regular Tensorflow Object Detection sample. After that we’ll follow some steps to build our little project.

This will be our end result:

First thing we’ll need to do is modify the method: visualize_boxes_and_labels_on_image_array . This will allow us use a different bounding box color for each team. Although we need to copy-paste the whole method, the change is pretty small:

        '''
        if agnostic_mode:
          box_to_color_map[box] = 'DarkOrange'
        elif track_ids is not None:
          prime_multipler = _get_multiplier_for_color_randomness()
          box_to_color_map[box] = STANDARD_COLORS[
              (prime_multipler * track_ids[i]) % len(STANDARD_COLORS)]
        else:
          box_to_color_map[box] = STANDARD_COLORS[
              classes[i] % len(STANDARD_COLORS)]
        '''
        box_to_color_map[box] = STANDARD_COLORS[team[i]]
        

We commented a lot of stuff and assigned the color based on a team array that contains different numbers for each team.

Then we’ll have our main method which will let us detect the teams. At a high level this method performs the following steps:

  • Performs object detection and filters people
  • Processes the coordinates to feed them into the k-means
  • Use k-means to find clusters
  • Displays the images with the teams detected
def detect_team(model, frame,df):
  # the array based representation of the image will be used later in order to prepare the
  # result image with boxes and labels on it.
  
  person_class = 1
  original_image = frame
  
  image_np = frame
  # Actual detection.

  output_dict = run_inference_for_single_image(model, image_np)

  boolPersons = output_dict['detection_classes'] == person_class
  output_dict['detection_scores'] = output_dict['detection_scores'][boolPersons]
  output_dict['detection_classes'] = output_dict['detection_classes'][boolPersons]
  output_dict['detection_boxes'] = output_dict['detection_boxes'][boolPersons]

  r_points = []
  b_points = []
  g_points = []    


  for i in output_dict['detection_boxes']:
    new_box = denormalize_coordinates(i,original_image.shape[1],original_image.shape[0])
    im2 = original_image[int(new_box[0]):int(new_box[2]),int(new_box[1]):int(new_box[3]),:]
    r_points.append(im2[:,:,0].mean())
    b_points.append(im2[:,:,1].mean())
    g_points.append(im2[:,:,2].mean())

    new_row = {'R':im2[:,:,0].mean(), 'G':im2[:,:,1].mean(), 'B':im2[:,:,2].mean()}
    df = df.append(new_row, ignore_index=True)

  #print(df.shape)
  if len(output_dict['detection_boxes']) > 1:
    kmeans = KMeans(n_clusters = 2, init = 'k-means++', max_iter=1000, n_init = 100, random_state=0)
    y_kmeans = kmeans.fit_predict(df)
  
    visualize_boxes_and_labels_on_image_array(
      image_np,
      output_dict['detection_boxes'],
      output_dict['detection_classes'],
      output_dict['detection_scores'],
      category_index,
      instance_masks=output_dict.get('detection_masks_reframed', None),
      use_normalized_coordinates=True,
      line_thickness=8,
      team = y_kmeans)
    '''
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(r_points, b_points, g_points, c=y_kmeans)
    plt.show()
    '''
  return image_np

Another interesting part is how we apply k-means. Given that images in numpy are represented with a tridimensional vector (red, green ,blue) we average each layer and get 3 numbers per people object. We feed those 3 dimensions into the k-means and get the clusters.

You can also display the k-means visualization by uncommenting these lines:

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(r_points, b_points, g_points, c=y_kmeans)
    plt.show()

I also added a code snippet that you can use to read a video and generate another video with the detected teams:

from google.colab.patches import cv2_imshow
import cv2
FILE_OUTPUT = "test.avi"

PATH_TO_TEST_IMAGES_DIR = pathlib.Path('models/research/object_detection/test_images/soccer.avi')

vcap = cv2.VideoCapture('models/research/object_detection/test_images/soccer.avi')
frame_width = int(vcap.get(3))
frame_height = int(vcap.get(4))

out = cv2.VideoWriter(FILE_OUTPUT, cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'),
                     24, (frame_width, frame_height))
ret, frame = vcap.read()


i = 0
while(i<1):
    ret, frame = vcap.read()
    im = detect_team(detection_model, frame,df)
    #cv2_imshow(im)
    out.write(im)
    i = i+1

vcap.release()
out.release()

Take a look at the video:

You can find the code on this repository.

Standard

Object Detection. A shortcut when thinking about labeling images

Detecting objects on an image can be accomplished by using a deep learning model. There a lot of pre-trained models on the Internet that you can use. Sometimes you might want to train your model to detect a specific object (sharks, squirrels, a mask on a person’s face…). There are multiple tutorials about how to train these models on custom datasets. Don’t.

Before even thinking about creating your own dataset. Downloading 100s of images and labeling them using labelimg can take a lot of time. And in some cases, it might be unnecessary. Test some out of the box models before going into the long route.  This step won’t take long and can save you a ton of time.

I am going to use a couple of examples. I want to detect certain animals in pictures. Sharks and squirrels. Let’s say we have a research purpose to do this. Before using the long route let’s try the other approach.

We are going to use the official Google Collab notebook to test the different models. The notebook is pretty straight forward if you run all the cells you are going to use a common model trained on a dataset called Coco. The full name of the model: ssd_mobilenet_v1_coco_2017_11_17.  This is a fast model but looks like it won’t work for our purposes. I uploaded an image to /content/models/research/object_detection/test_images and this is the result we got:

squirrel mscoco 3.png

Not the results we were expecting. It has a low confidence and not a very good prediction. If we assume that’s a squirrel.

Before changing some stuff on the code, you can find some pre-trained ready to use models on the Tensorflow detection model zoo. There are models trained on different datasets and with different performances.

Let’s change a couple of lines of code and test again. First we are going to use a model from the Inaturalist dataset:

  • On the section “Loading Label Map” we are going to use the following code: PATH_TO_LABELS = ‘models/research/object_detection/data/fgvc_2854_classes_label_map.pbtxt’  The label map helps us interpret the output of the new model. It ties the category number to a name.
  • On the section “Detection” we are going to use the following code: model_name = ‘faster_rcnn_resnet101_fgvc_2018_07_19’ This will tell the code which model to download.

With the new model this are the results:

download.png

Higher confidence and a weird latin name (scriurus carolinensis). If we use wikipedia we find that the other known name is Eastern gray squirrel. Not bad. The same if we test a shark image:

whale shark naturalist.png

Using Wikipedia we can find that it’s a whale shark.

Just for the same of experimenting I used a model trained on the OpenImage dataset. I used the following label map and model:

  • model_name = ‘faster_rcnn_inception_resnet_v2_atrous_oid_2018_01_28’
  • PATH_TO_LABELS=
    ‘models/research/object_detection/data/oid_bbox_trainable_label_map.pbtxt’
shark open image.png

Here we got a higher confidence but not as accurate detection. Depending on your use case one model or the other could be better.

In some cases you’ll still need a custom dataset and going the long route. But checking this avenue won’t hurt and might save you a lot of precious time.

Standard

How to survive multiple interview processes

As Paul Krugman says: These are strange times. We are living under unusual circumstances and there are a ton of negative stuff to think about. A lot of people have been laid off of their jobs and are looking for a new gig.

For software developers we have an interesting market, some companies are reducing the headcount and some others are hiring. For junior or mid software developers there is an additional challenge they might fit in different positions based on the programming languages they know or like. Which one should I study? Should I polish the ones I already know? Should I learn a new one? As always the answer is: it depends. I’ll elaborate on the following paragraphs.

I’ve seen a lot of similarities between finding a new job and selling enterprise software (please don’t run yet, bare with me). In sales we have a pipeline were we manage opportunities and try to maximize our sales numbers. Here you only need 1 win, only need to be right once. I’ll explain in simple terms what’s a pipeline and how you can use it to survive this tough situation.

The pipeline has 5 stages. We have some interesting assumptions and facts:

  • Not all opportunities have to go through the 5 stages.
  • As we advance in the pipeline the probability of winning should increase.
  • Opportunities have different velocities to the pipeline, some can change really quickly and others can be very slow.

The 5 stages are:

  1. Identify: Here opportunities are like gossiping. You heard a company is hiring, you read a post on LinkedIn. At this point we know there might be an opportunity but you don’t even know if you have what it takes.
  2. Qualify: To be able to pass to this stage you must assess if the opportunity is a fit for you. Review the seniority level they are looking for, years of experience, programming languages and industry.
  3. Pursue: Once you send your application (CV) or you asked a friend to refer you, then we are on pursue. By this point, it might be a good idea to polish or learn new languages.
  4. Closing: Now we are in the interview phase. This can vary a lot from company to company: you can have several phone interviews, in the past, you might have an on-site and you could be interviewed by your future boss, to name a few.
  5. Won: You accepted the offer! Congrats! Hopefully, you’ll be on this stage pretty soon.

After that long explanation, I would say you should study new languages or polish the ones you know based on your pipeline. Probably doing an Elixir course based on an opportunity in the Identify stage could not be a great idea (it could if you really want to work on Elixir). You could get serious about a new language if you are on Pursue or Closing.

This advice might not apply to everyone and every situation. It’s my 5 cents to try to help in this difficult time we are living.

Good luck finding your next gig!

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

 

Standard

How to create a practical Open AI custom environment for the rest of us (sourcing problem)

Why create a custom Open AI?

Today we are using classical Machine Learning and Neural Networks to solve all kinds of problems. Reinforcement Learning is being used to solve games and some industrial applications, but I think this will change pretty soon. Instead of tagging images, we will be creating business environments for our agents to learn and perform in real life. I’ve read some tutorials on how to create an environment but the framing is the difficult part.

The sourcing problem

I wanted to build an environment that had to do something with business. All the companies have sourcing departments. People who buy stuff, either to resell it (as e-commerce) or to operate the business (you need pens, cars, Dunder Mifflin paper…).

When you buy the stuff you have to make decisions. Imagine that you have 5 suppliers, each one has a different price and different reliability. There are very cheap suppliers but not very reliable and there is a premium that always delivers on time. You can find the example on the following table:

Screen Shot 2020-04-12 at 11.07.22 AM.png

In this case, paying a low price carries a high risk. Every supplier sells you 5 articles at a time and you only have $1000 to buy as many articles as you can. The best-case scenario would be that you are super lucky and buy always

Framing the sourcing problem as a Reinforcement Learning problem

Before start writing the code, there are some things we need to define. In my case this takes me more time than writing the actual code:

  • Action space: Which actions can your agent take? Here we will have 5 actions, buy from 1 of the 5 suppliers.
  • Observation space: This is what our agent will see. In our case, we will let him know how many articles he currently has, how close he is to the goal (50 articles).
  • Reward: You need to give a prize to your agent when he does good. Here we will use the following function: (current articles/max_articles:50)^0.9

Screen Shot 2020-04-12 at 11.12.22 AM.png

For me, the trickiest points here are the observation space and the reward function. In the observation space, I tried to read a lot of code from existing environments in Open Ai documentation. In this table, you can find environments and their types so you can look faster. Choosing the reward function is an art, I watched a great youtube video that gave me the idea of choosing this function.

Simple steps to create a custom Open AI Environment

Once you framed the problem in an RL way it starts going downhill.

  • You need to create the file structure for your environment. There is great documentation on Open AI. Be careful with the names and make sure you replace everything, this can get cumbersome fast.
  • After you’ve done this you need to modify some specific methods:
    • __init__: where you initialize all the variables.
    • step: whenever your agent takes action, it will call this method. The environment will return a state so the agent can take its next decision.
    • reset: Once you’ve reached a terminal state (you are out of money or bought all the articles) you need to reset everything that way the agent can try again.

Creating an agent

To test the environment I used q-learning. Explaining this is outside of the scope of this blogpost. I used the code from this great tutorial, I recommend watching his videos in case you want to learn more about Reinforcement learning. You can find the code here.

Results

I ran the agent with different training episodes and plotted how he performed.

With 1 training episode (almost random):

The lowest action is riskier but the cheapest. In this example, it managed to buy 30 articles.

Using 600 training episodes it managed to solve the problem. It was able to buy 50 articles and use an average approach.

Code

You can find the code on this repository.

I’ll be happy to hear about experiments, ideas or questions on twitter

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

 

 

 

Standard

Easy and fast path to Video Object Detection (counting sharks)

Video Object Detection is a very interesting problem that could help a lot of people. I found out about it talking to a shark researcher (maybe not his exact title). They have grad students counting sharks in a video from an underwater camera. These videos can be very long and sometimes there are no sharks in hours. I thought about Machine Learning instantly, what could go wrong. I started reading about it and found different approaches.

  • Auto ML Solution (Google, MSFT…): I used these solutions in the past with images, with good results. The con is that these services do not provide video support, at least I was not able to find it.
  • Tensorflow:  I watched a ton of videos of examples of the Object Detection API. Be careful with the videos, search for recent ones the version changes can make very hard to follow the tutorial. I had some trouble trying to train the model with my own images. It might have been a combination of the documentation, my package management and maybe luck. I ended looking for another way.
  • Tensorflow Object Counting API: I found this repository. It has great examples and it’s built on top of Tensorflow. I still had some problems training my own images. My only comment would be that this API still lacks the abstraction I wanted to see on an API, at least for the training part.
  • Detecto: I found about this repository and the first thing I noticed was that it promised the abstraction I was looking for. I managed to train with my own images, all the different examples are ~5 lines of code. You don’t need to understand about Pytorch in order to use it. I was able to run it on a Google Collab, the free GPU’s made the training process faster. At some point going to the Tensorflow could make sense, but to start I recommend Detecto.

I need to feed more images to the model but here are is an example of the results:

shark_result.jpeg

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

Standard

Your diverse background can be your main advantage. Supercharge it this quarantine.

I’ve seen in twitter a lot of people learning and coding on this quarantine, which is great. Learning and doing hard stuff can keep your mind healthy and you won’t get bored. There is something I think we should not miss while spending time at home learning.

Imagine that in the whole industry everybody came from the same background. Everybody majored in Computer Science, watched the same movies, had the same interests and liked to work in the same problems. We were not very far from this some years ago. Now we are in a very different place as an industry, we have a lot of folks coming from different places (education, work experiences, interests, culture..). Which is what need to solve the most challenging problems. We are good, but we can be better.

I’ve seen a lot of people studying frontend, backend, ML, Blockchain and similar things. Which is good and you need to do it, although don’t forget the other stuff. By that I mean, the stuff that makes you unique. As I mentioned at some point, hiring decisions can be very tight and sometimes you want to hire 2 people but just can’t. It’s the small stuff that makes the difference (happens very similar when thinking about promotions and raises). The good news is that small stuff it’s hard to copy because it has to do with you.

Here you’ll find a list about “other stuff” and how you can supercharge it this quarantine:

  • Your past work experience: This is a great tool from people transitioning into tech. These experiences can be a great advantage if they align with the role you are applying to. For example, you were a tour guide before jumping into Computer Science and you are applying now for a Software Developer role in Tripadvisor. And in order yo supercharge it, learn more about the industries you’ve worked, take a small MOOC, read some news about it or read a book on that field. We tend to underestimate our knowledge about a certain industry, trust me after a couple of years working on a specific industry you’ll have a ton of knowledge, use it!

 

  • Your interests: Humans tend to be curious about a group of fields and these can be uncorrelated. You might want to learn about anthropology, astronomy, medicine, sports, food or almost anything. Take a small break from coding and learn about a field you are interested in. This will have two effects: you might find relationships between the field and coding, and this could help you in a future interview. Either by the industry, you are applying or to build rapport with the interviewer. Just taking a break from coding can bring great results.

 

  • An action related to coding: You’ve been coding for the past 2 days non-stop that’s good. But also have in mind other related tasks that are not just coding. Try writing a Blogspot, create a video-tutorial or write a small book. These are great superpowers that can make you a better developer.

The main goal of this quarantine is to stay safe. It’s great that a lot of people are learning new stuff and hopefully a lot will be able to land new gigs soon. Keep coding (and doing other stuff)!

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

Standard

4 non-tech interview questions that can make you stand out and land the job

The interview process is an art, probably a broken one. I know it has a lot of different factors and some maybe even random. But let’s focus on what we can control. We can control what you prepare and how well you perform. For the technical side (algorithms, inverting a binary tree…) there are a ton of good resources you can look for. 4 questions tend to be overlooked. These questions are not enough to turn around a bad technical interview, but you can check all the boxes if you ace them (a lot of these processes finish with very tight decisions).

  1. What have you heard about this company? This is a great opportunity for you to show how much you have prepared and how bad you want this job. You don’t want to show like this is just another interview in your whole process. Or worse that this is practice for the interviews you really care about. Here are some ways you can prepare:
    • Start by understanding the sector. You need to have which sector are they and how do they make money. Google can help you or talk to a friend who knows about the company/sector. Understand what the sector is going through, are they growing, shrinking. Which are the biggest competitors/menaces?
    • Understand the company from 10,000 ft. Are they public or private? How much revenue did they make last year? Any interesting acquisitions lately? Something important on the news? If the company is public, try searching for the last earning call transcript. This will help a lot and you’ll find opportunities to use your preparation.
    • Understand the company on a local scale. How is this branch doing? When did they open? Are they the new kids on the block? What is their relationship with the mothership?
  2. Tell me about the hardest problem you have solved. This could be considered a technical question but I want you to focus on the storytelling. Set a good context about that last problem that you could not stop thinking about. Be very specific on the problem, the avenues you tried and how did you end up solving it. You need to project grit and that you never give up. Also, explain how would you approach a similar problem now that you have solved that one. Extra credit if the problem is related to the technologies they are working with.
  3. Tell me something interesting you read lately. The interviewer is trying to find out if you really like what you do. Do you read about stuff in your free time, how do you keep up with all the stuff that is changing. I don’t recommend that you read a super complex article so you can show off. Just try to remember what you have read lately, explain why it was interesting and what you learned. This does not need to be 100% related to the technologies they work on, it could be about another field that is interesting for you (cybersecurity, networking, anything related to tech).
  4. Do you have any questions for me? A lot of people miss this opportunity to show they have prepared and to gain valuable information that could help in the following processes. Ask genuine questions about their job, work-life on that company, how are they’ve been affected by global events (COVID19 for example) and anything that you might want to know. Try to come prepared with some questions but have your eyes open for any question that might come up during the conversation.

An interview should be a conversation. It could happen that these questions are not included. Although you can use different parts of the interview to show what you have prepared. The easier part is at the end when they let you ask questions. Use that free form time to show how you have prepared and why you are the right person for the job.

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

Standard

Which programming language should you learn on this quarantine?

I recently had a conversation with a group of college students about if they should learn R or Python. This might be a ML/AI kind of question although the possible answers apply to different areas of Computer Science. How do you choose between studying React or Angular? Rails or Laravel?

The first idea is that programming languages are like instruments. We use instruments to play music, but you can interpret the same song with different instruments. It’s really hard to choose between programming languages, every language has its pros and cons. If you understand well the fundamental principles (music) it does not matter which instrument you choose.

The second idea is borrowed. Inspired from a great tweet from Edouard Harris. If you are starting to code learn a language that interests you or that is related to one of your interests (build videogames for example). Once you are getting more familiar you can learn a language with bigger business applications.

The third idea would be what is your purpose for learning a language during the quarantine. Do you just want to have fun? Do you want to land a new gig? Pure intellectual curiosity? Answering this question might be a big percentage of the answer you are looking for.

It does not matter what you learn this quarantine, stay safe and happy coding.

Prueba001-05.jpg

30 things I learned at my first job (Daniel Rojas)

Standard