Join the Beautiful JavaScript Challenge!

Beautiful JavaScript Challenge

Check out the Beautiful JavaScript challenge on CodeSignal! You must submit your solution by noon on July 26. See the Official Rules at the end of this post for more details.

At CodeSignal, we heart JavaScript. A lot. In fact, we love it so much that our whole stack, front-end and back-end, is JavaScript. Node + React + Meteor FTW!

We know we’re not the only ones who feel this way, and we want to share the JavaScript love. That’s why we’re launching our first-ever Beautiful JavaScript challenge today!

Think of this challenge as a beauty pageant for code. Write your solution in the most elegant JavaScript you can, polish it up, then click Submit. Our panel of judges, the senior engineers here at CodeSignal, will be looking for code that’s expressive, readable, and concise. They’re going to choose the three most beautiful solutions to this challenge as the winners!

If you win, you’ll receive an all-inclusive trip to San Francisco, including a round-trip ticket from any city in the U.S. and three nights at a hotel downtown. And of course you’ll get to come hang out with us at CodeSignal HQ and nerd out about all things JavaScript and CodeSignal!

Excited yet? Yeah, so are we! We can’t wait to see the great solutions that you come up with. Join the Beautiful JavaScript challenge on CodeSignal!

Official rules for Beautiful JavaScript:

  • All CodeFighters can solve this challenge, but only people in the U.S. are eligible to win.
  • You must write your solution in JavaScript, obviously!
  • Your solution must be submitted before the contest closes at 12:00 PM Pacific Time on Wednesday, July 26.
  • The three winners will be notified by email and their names will be posted on the CodeSignal forum by 5:00 PM Pacific Time on Friday, July 28.

CodeSignal Solves It, Interview Practice Edition: groupsOfAnagrams

This week’s featured Interview Practice challenge was groupsOfAnagrams. It’s a Twitter interview question, and they’re known for their difficult technical interviews! If you haven’t solved it yet, go ahead and do that first, then head back here and we’ll discuss how to get to a solution with a good runtime.

…Done? Okay, let’s dig into this coding challenge!

The technical interview challenge

You are given a list of n words. Two words are grouped together if they are anagrams of each other. For this interview question, our task is to determine how many groups of anagrams there are.

For example, if the list words = [listen, admirer, tea, eta, eat, ate, silent, tinsel, enlist, codesignal, married], we should have groupsOfAnagrams(words) return 4 because there are 4 groups:

  • listen, silent, tinsel, enlist are all anagrams of each other.

  • tea, eta, eat, ate are all anagrams of each other.

  • married and admirer are anagrams of each other.

  • codesignal is not an anagram of anything in words except itself.

In the challenge, we are guaranteed that each word only consists of lowercase letters. So we don’t need to worry about the rules regarding spaces and punctuation. Each word is guaranteed to be at most 10 letters. The hard part is that we have to do this on lists containing up to 10^5 words!

First attempt: Determining if two words are anagrams

There are a few standard ways to determine whether or not words are anagrams of each other. One way is to go through each letter of the alphabet, and check that it appears the same number of times in each word.

alphabet = string.lowercase

def areAnagrams(word1, word2):
    for char in alphabet:
        if word1.count(c) != word2.count(c):
            return False
    return True

What is the runtime of this algorithm? Suppose a typical word has L characters. In calling word.count(char), we are asking the computer to scan word to count the number of times character char appears. This is an O(L) operation. Technically, this makes areAnagrams an O(L) algorithm because we do this count operation for each letter of the alphabet, which doesn’t depend on the size of the word. However, big O notation suppresses the really bad constants hiding in the runtime! Consider that our words have at most 10 letters, and some may be repeats. That means that of our 26 iterations of the loop, at least 16 of them are wasted looking for letters that don’t appear. If we call the number of letters in the alphabet A, then the runtime of this algorithm is O(AL).

Interview Tip:

Asymptotic, or “big O”, notation misses constants. Analysis such as treating the size of the alphabet as a parameter is useful to estimate the size of the constants. This is particularly true in this case, where L is guaranteed to be less than the size of the alphabet A.

If the interviewer questions why you’re doing this (after all, English is almost certainly going to have 26 lowercase letters for our lifetimes!) you have the chance to impress them by saying that you’re considering what happens if punctuation is allowed, or if the company decides to look at large unicode character sets. This shows that you’re forward thinking, and have design on your mind – two things that it’s important to emphasize in technical interviews.

Basically, if using a constraint in the problem helps you significantly reduce the problem, feel free to use it. If you need time while you’re working on your solution, or are doing something odd like considering the number of letters in the alphabet as a variable, you can always defend your move as considering the design implications of the constraint!

We can reduce this runtime of our solution significantly by going through each letter in word1 one at a time. This is at most 10 letters, rather than 26. Here’s one way of implementing it:

# A faster anagram checker
def areAnagrams(word1, word2):
    cntLetters = {}
    # Count occurrences in word1 as positive
    for char in word1:
        cntLetters[char] = cntLetters.get(char,0) + 1
    # Remove occurrences in word2
    for char in word2:
        if char not in cntLetters:
            return False
        cntLetters[char] = cntLetters[char] - 1

    # cntLetters should only contain 0s, because
    # each letter should occur the same number of
    # times.
    for diff_in_count in cntLetters.values():
        if diff_in_count != 0:
            return False

    # All the letters occurred the same number of times
    return True

While the code is longer, and it’s still O(L), it is no longer O(AL). Note that we are no longer assuming the strings are only made of lowercase letters!

Approach 1: Works, but too slow…

We have an anagram checker that will tell us if any pair of words are anagrams.

# This is too slow :(
# METHOD A
def groupsOfAnagrams(words):
    # make a copy of words, so user doesn't
    # lose their words
    tmp_words = words[:]
    groups = 0

    while len(tmp_words) > 0:
        # If the word survived this far, it isn't part
        # of any existing group
        new_word = tmp_words.pop()
        groups += 1

        # remove all words that are anagrams of new word
        tmp_words = [w for w in tmp_words if areAnagrams(w,new_word) == False]
    return groups

What is the runtime here? We run through the loop O(N) times, where N is the number of words. If we are lucky, we remove a lot of words early. However, if each word is not an anagram of any other word, the while loop runs exactly N times. Inside the loop, we scan the remaining words for anagrams. So this is an O(N^2) algorithm. Since we can have up to 10^5 words in a list, O(N^2) probably won’t work. Sure enough, this code doesn’t pass the tests on CodeSignal – and it’s not an answer that will knock the interviewer’s socks off during a technical interview. We can do better!

Approach 2: Precomputing invariants

We are going to find an invariant that all the words that are anagrams share with each other, but don’t share with any words they are not anagrams of. We will use these as keys in hash tables with dummy values. At the end, we will simply count the number of keys in the hash table.

What are the invariants that we can use? A simple invariant is the number of as, the number of bs, …, etc. Any two words that are anagrams will have the same invariant, and any two words that aren’t anagrams will have different invariants.

# METHOD B
alphabet = string.lowercase
# Our first invariant is a tuple of tuples with the number
# of letters that make an appearance, e.g.
# word = "worddd" --> (('d',3), ('o',1),('r',1'),('w',1))
def calcInvariant(word):
    cntLetter = [0]*len(alphabet)
    for letter in word:
        # Add one more occurence of letter
        cntLetter[ord(letter) - ord('a')] += 1

    invariant = tuple([(letter, cntLetter[index]) for index,letter in enumerate(alphabet) if cntLetter[index] > 0])
    return invariant

For each word, this is an O(AL) process, as we are iterating through both the letters in the word and in the alphabet. The gain is going to occur because we only calculate it once per word.

Our algorithm is then:

# METHOD B & C
def groupsOfAnagrams(words):
    hash_of_invariants = {}
    for w in words:
        invariant = calcInvariant(w)
        # we don't actually care about the value,
        # instead we just care how many distinct
        # invariants there are.
        hash_of_invariants[invariant] = 1

    # return the number of invariants
    return len(hash_of_invariants)

Because calcInvariant is O(AL), where L is the number of letters in the word, we see this algorithm has time complexity O(NAL). Note that we never compare invariants. Instead, we use the magic of hash tables to set the value of 1 to the appropriate entry. If that key already exists, we overwrite the value. If it doesn’t exist, a new value is created. It was the comparison between keys that gave us our problematic O(N^2).

A cleaner, faster, invariant

There is another natural invariant for the words: sorting the letters that occur in the words. We could use:

# METHOD C
# calcInvariant('eat') = 'aet'
# calcInvariant('ate') = 'aet'
# calcInvariant('apple')='aelpp'
def calcInvariant(word):
    return tuple(sorted(word))

This has the advantage of being easy to understand, and not making any assumptions about the underlying algorithm. It is a sorting algorithm, so it is O(L log L) instead of O(AL). (Technically we can’t compare algorithms using big O notation like this, because we don’t know the constants involved.) The sorted string can then be used for the invariant. In fact, the code for groupsOfAnagrams doesn’t have to change at all!

This is faster because our sorting algorithm on small words is much faster than our O(AL) dictionary. If the words are long (e.g. we are comparing novels) then our compression of the string to count the number of times the characters are repeated would win. When accessing the hash table hash_of_invariants we need to use the hash function, which depends on the length of the keys. Since L < 10, we know that each key is at most 10 characters long, so the sorting invariant is elegant and understandable.

Dictionaries where we don’t care about the value: Sets

In our hash table, we didn’t really care about the values assigned to the keys. We were only interested in the number of distinct keys we had, as that told us how many groups of anagrams we had.

There is nothing wrong with this, except that someone reading your code may wonder what the significance of the 1 is when you wrote:

    ...
    hash_of_invariants[invariant] = 1
    ...

The comments certainly help. But Python has a built-in data structure for this case called set. If we made a set of invariants instead, we would simply have:

    ...
    set_of_invariants.add(invariant)
    ...

There is not a visible dummy value that might confuse future maintainers of your code. Under the hood, the Python set is implemented as a dictionary (hash table) with hidden dummy values with a few optimizations, so the real reason for doing this is readability. Because hash tables are used more frequently than sets, there is a valid argument that the hash table is more readable for more programmers.

The final solution

Here is our final, compact solution! Any interviewer would be happy to see this solution during a technical interview.

# METHOD D: using sets
def groupsOfAnagrams(words):
    invariants = ["".join(sorted(w)) for w in words]
    # make a set out of invariants
    # (i.e. make a dummy hash table with keys being the entry in
    #  invariants, and a hidden dummy variable)
    unique_invariants = set(invariants)
    return len(unique_invariants)

Timing all the methods

In an interview, you always have to consider the runtime. As you could see, a lot of the effort in creating a solution for this interview question was finding an algorithm that was quick enough! I timed the algorithms for two cases:

  • Our first example, words = [listen, admirer, tea, eta, eat, ate, silent, tinsel, enlist, codesignal, married]

  • A pathological example, where each word was its own group of 10 letters, starting with abcdefghij, abcdefghik, …, to abgilnoz (this is a total of 10^5 words)

I didn’t include the raw times, as that tells you more about my computer than the algorithm. Instead I normalized to the time for Method D on the list words (i.e. Method D has a time of 1 by definition).

Method Description Time for words (compared to D on words) Time for 10^5 words (compared to D on words)
A O(N^2) method 5.4 Too Long to Time
B Using precalculated tuple invariant for hash table 6.4 70300
C Using sorted words for hash tables 1.2 10450
D Using sorted words for sets 1.0 7000

Tell us…

Did your solution for this Interview Practice challenge look different than mine? How did you approach the problem? Let us know over on the CodeSignal forum!

CodeFights Solves It: hostnamesOrdering

hostnamesOrdering solution

The Challenge of the Week that ended on March 21 was hostnamesOrdering. Only 72 CodeFighters submitted solutions – this was a tricky one! In this breakdown, I’m going to walk you through my thought process in writing a solution.

In this database challenge, the task is to take a table of numeric ids and hostnames, and return a sorted version of the table. The challenge is that we have to order the hostnames by the reverse hostnameThe hostname is a string of domains separated by periods, such as codefights.com or www.github.com. The domains don’t include any protocols (such as http://), nor do they include any paths (such as codefights.com/forums). The general form of a hostname is domain1.domain2. … .domainN. The reversed hostname is domainN. … .domain2.domain1. For example:

[table id=6 /]

Note that the list above is ordered by the reverse hostname. We are told that there are at most 3 domains in any given hostname that appears in this challenge.

Getting started

Let’s start with an example of the table hostnames:

[table id=7 /]

Let’s suppose for a moment that we have a way of generating the reversed hostnames, which will actually be the hardest part of the challenge.

[table id=8 /]

The query we want to return is:

# Assuming we can make a ‘reversed’ column
SELECT id, hostname FROM hostnames ORDER BY reversed;

which would return:

[table id=9 /]

Reversing the domain name

We won’t actually construct an explicit reversed column. Instead, we will generate the pieces of the domain we want on the fly. To do this, we will be using the SUBSTRING_INDEX function. From the documentation, we see that SUBSTRING_INDEX( string, delim, count) returns a substring.

If count is positive, SUBSTRING_INDEX( string, delim, count) returns the substring from the beginning to the countth occurrence of delim. For example:

mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.',1);
          -> www
mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.', 2);
          -> www.codefights
mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.', 45);
          -> www.codefights.com

If count is negative, SUBSTRING_INDEX finds the |count|th occurrence of delim from the right, and returns the substring from there to the end of the string. For example:

mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.', -1);
       -> com
mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.', -2);
       -> codefights.com
mysql> SELECT SUBSTRING_INDEX('www.codefights.com', '.', -100);
       -> www.codefights.com

Using the same example we’ve been working with, here is a query that splits the hostname into three pieces. Using SUBSTRING_INDEX, we can extract the top level domain, the midlevel domain, and the lowest level of domain just by changing the index value.

SELECT id, hostname, SUBSTRING_INDEX(hostname,'.',-1) as top,
                                      SUBSTRING_INDEX(hostname,'.',-2) as mid
               FROM hostnames ORDER BY top, mid, hostname;

This would work if every domain has exactly three pieces. Unfortunately, some domains only have one or two pieces. What is actually returned is:

[table id=10 /]

The first two rows are in the wrong order. They match at the top level (i.e., they are both .com domains). The com domain has no lower level, but the SUBSTRING_INDEX just returns the entire string com again. Since codefights.com is earlier in the alphabet than com, the table puts codefights.com first. What should happen is that codefights should be compared to the empty string, and com should appear first.

One trick to fix this problem is to prepend the hostnames with ‘…’, guaranteeing that there are three periods in each domain. So our query is now:

SELECT id, hostname, SUBSTRING_INDEX(CONCAT('...', hostname),'.',-1) as top,
                                      SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2) as mid
               FROM hostnames ORDER BY top, mid, hostname;

Writing down just the first four rows, this query returns:

[table id=11 /]

This has everything in the right order!

A problem and a workaround

In principle, we just have to eliminate the top and middle column, so we are only reporting the id and hostname. Before making this change, we can try running this in CodeFights. When we do it, we run into a problem: the CodeFights MySQL engine returns workbench.mysql.com before dev.mysql.com!

This seems to be a bug; running MySQL on a local machine returns dev.mysql.com before workbench.mysql.com. However, ordering by the hostname on CodeFights returns the opposite. When using ORDER BY top, mid, hostname, workbench.mysql.com appears first. This seems wrong, since both these hostnames have identical top and mids, so the ordering should be determined by hostname.

Part of being a programmer is working around issues in any software! The problem we are encountering comes from doing multiple orderings. Our workaround will be to concatenate the top, middle, and hostname into one string. We will separate the top, middle, and hostname by a space, because spaces cannot occur in a domain and spaces appear earlier in ASCII than any letter.

So our workaround is:

SELECT id, hostname, SUBSTRING_INDEX(CONCAT('...', hostname),'.',-1) as top,
                                      SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2) as mid
                FROM hostnames ORDER BY CONCAT(top, ' ', mid, ' ', hostname);

This orders dev.mysql.com and workbench.mysql.com correctly.

Eliminating columns

Now we need to eliminate the top and mid columns. We could do a subquery, but instead we will actually construct the top and the middle strings as part of the ORDER BY clause.

Our solution ends up looking like this:

CREATE PROCEDURE hostnamesOrdering()
    SELECT id, hostname FROM hostnames
          ORDER BY CONCAT(SUBSTRING_INDEX(CONCAT('...',hostname),'.',-1),
                                               ' ',
                                               SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2),
                                               ' ',
                                               hostname);

This solution only selects the desired columns. Notice that in the ORDER BY we have a single string, made from the “top” level domain, a space, the “middle level” domain, a space, and then the hostname. This gets around the implementation bug, and gets us a solution with 188 characters.

On a local MySQL database, we don’t need the final concatenation, and we could use multiple ordering to get:

CREATE PROCEDURE hostnamesOrdering()
    SELECT id, hostname FROM hostnames
          ORDER BY SUBSTRING_INDEX(CONCAT('...',hostname),'.',-1),
                              SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2),
                              hostname;

Going further

As a code golfing exercise, we have some compact solutions. As a coding exercise, you might be interested in trying to solve the general case where we don’t have a limit on the maximum number of domains in a hostname!

This takes us back to our original solution:

SELECT id, hostname FROM hostnames ORDER BY reverse_hostname(hostnames);

where the interesting exercise is to create a custom function reverse_hostname. If you get stuck, this StackOverflow post might be useful.

Tell us…

How did you tackle this challenge? What do you think about the approach I took in solving it? Let us hear from you in the CodeFights forum!

CodeSignal Solves It: hostnamesOrdering

hostnamesOrdering solution

The Challenge of the Week that ended on March 21 was hostnamesOrdering. Only 72 CodeFighters submitted solutions – this was a tricky one! In this breakdown, I’m going to walk you through my thought process in writing a solution.

In this database challenge, the task is to take a table of numeric ids and hostnames, and return a sorted version of the table. The challenge is that we have to order the hostnames by the reverse hostnameThe hostname is a string of domains separated by periods, such as codesignal.com or www.github.com. The domains don’t include any protocols (such as http://), nor do they include any paths (such as codesignal.com/forums). The general form of a hostname is domain1.domain2. … .domainN. The reversed hostname is domainN. … .domain2.domain1. For example:

[table id=6 /]

Note that the list above is ordered by the reverse hostname. We are told that there are at most 3 domains in any given hostname that appears in this challenge.

Getting started

Let’s start with an example of the table hostnames:

[table id=7 /]

Let’s suppose for a moment that we have a way of generating the reversed hostnames, which will actually be the hardest part of the challenge.

[table id=8 /]

The query we want to return is:

# Assuming we can make a ‘reversed’ column
SELECT id, hostname FROM hostnames ORDER BY reversed;

which would return:

[table id=9 /]

Reversing the domain name

We won’t actually construct an explicit reversed column. Instead, we will generate the pieces of the domain we want on the fly. To do this, we will be using the SUBSTRING_INDEX function. From the documentation, we see that SUBSTRING_INDEX( string, delim, count) returns a substring.

If count is positive, SUBSTRING_INDEX( string, delim, count) returns the substring from the beginning to the countth occurrence of delim. For example:

mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.',1);
          -> www
mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.', 2);
          -> www.codesignal
mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.', 45);
          -> www.codesignal.com

If count is negative, SUBSTRING_INDEX finds the |count|th occurrence of delim from the right, and returns the substring from there to the end of the string. For example:

mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.', -1);
       -> com
mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.', -2);
       -> codesignal.com
mysql> SELECT SUBSTRING_INDEX('www.codesignal.com', '.', -100);
       -> www.codesignal.com

Using the same example we’ve been working with, here is a query that splits the hostname into three pieces. Using SUBSTRING_INDEX, we can extract the top level domain, the midlevel domain, and the lowest level of domain just by changing the index value.

SELECT id, hostname, SUBSTRING_INDEX(hostname,'.',-1) as top,
                                      SUBSTRING_INDEX(hostname,'.',-2) as mid
               FROM hostnames ORDER BY top, mid, hostname;

This would work if every domain has exactly three pieces. Unfortunately, some domains only have one or two pieces. What is actually returned is:

[table id=10 /]

The first two rows are in the wrong order. They match at the top level (i.e., they are both .com domains). The com domain has no lower level, but the SUBSTRING_INDEX just returns the entire string com again. Since codesignal.com is earlier in the alphabet than com, the table puts codesignal.com first. What should happen is that codesignal should be compared to the empty string, and com should appear first.

One trick to fix this problem is to prepend the hostnames with ‘…’, guaranteeing that there are three periods in each domain. So our query is now:

SELECT id, hostname, SUBSTRING_INDEX(CONCAT('...', hostname),'.',-1) as top,
                                      SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2) as mid
               FROM hostnames ORDER BY top, mid, hostname;

Writing down just the first four rows, this query returns:

[table id=11 /]

This has everything in the right order!

A problem and a workaround

In principle, we just have to eliminate the top and middle column, so we are only reporting the id and hostname. Before making this change, we can try running this in CodeSignal. When we do it, we run into a problem: the CodeSignal MySQL engine returns workbench.mysql.com before dev.mysql.com!

This seems to be a bug; running MySQL on a local machine returns dev.mysql.com before workbench.mysql.com. However, ordering by the hostname on CodeSignal returns the opposite. When using ORDER BY top, mid, hostname, workbench.mysql.com appears first. This seems wrong, since both these hostnames have identical top and mids, so the ordering should be determined by hostname.

Part of being a programmer is working around issues in any software! The problem we are encountering comes from doing multiple orderings. Our workaround will be to concatenate the top, middle, and hostname into one string. We will separate the top, middle, and hostname by a space, because spaces cannot occur in a domain and spaces appear earlier in ASCII than any letter.

So our workaround is:

SELECT id, hostname, SUBSTRING_INDEX(CONCAT('...', hostname),'.',-1) as top,
                                      SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2) as mid
                FROM hostnames ORDER BY CONCAT(top, ' ', mid, ' ', hostname);

This orders dev.mysql.com and workbench.mysql.com correctly.

Eliminating columns

Now we need to eliminate the top and mid columns. We could do a subquery, but instead we will actually construct the top and the middle strings as part of the ORDER BY clause.

Our solution ends up looking like this:

CREATE PROCEDURE hostnamesOrdering()
    SELECT id, hostname FROM hostnames
          ORDER BY CONCAT(SUBSTRING_INDEX(CONCAT('...',hostname),'.',-1),
                                               ' ',
                                               SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2),
                                               ' ',
                                               hostname);

This solution only selects the desired columns. Notice that in the ORDER BY we have a single string, made from the “top” level domain, a space, the “middle level” domain, a space, and then the hostname. This gets around the implementation bug, and gets us a solution with 188 characters.

On a local MySQL database, we don’t need the final concatenation, and we could use multiple ordering to get:

CREATE PROCEDURE hostnamesOrdering()
    SELECT id, hostname FROM hostnames
          ORDER BY SUBSTRING_INDEX(CONCAT('...',hostname),'.',-1),
                              SUBSTRING_INDEX(CONCAT('...',hostname),'.',-2),
                              hostname;

Going further

As a code golfing exercise, we have some compact solutions. As a coding exercise, you might be interested in trying to solve the general case where we don’t have a limit on the maximum number of domains in a hostname!

This takes us back to our original solution:

SELECT id, hostname FROM hostnames ORDER BY reverse_hostname(hostnames);

where the interesting exercise is to create a custom function reverse_hostname. If you get stuck, this StackOverflow post might be useful.

Tell us…

How did you tackle this challenge? What do you think about the approach I took in solving it? Let us hear from you in the CodeSignal forum!

CodeFights Solves It: pastEvents

pastEvent solution

Preparing for interviews? Looking for a new job?

Practice for interviews using real questions on need-to-know topics.
Get matched with our hiring partners when you're ready for a new job!

Watch video

See how Marcus used CodeFights and landed a job at Evernote!

CodeSignal Solves It: pastEvents

A little while back, our Challenge  of the Week was a tricky database challenge called pastEvents. 281 of our awesome CodeFighters solved it! I’ve created a walkthrough of the problem and a few different tactics for finding a solution. Here’s the problem, in case you need a refresher:

During the most recent social event you suddenly realized that you had forgotten your USB drive on one of the previous events. You are pretty sure that you had your flash drive with you just last week, which means that it was probably lost during one of the events of the last 7 days. So, you would like to find all the events you attended during this period.

And the output should be:

A table of all names and dates of events within 7 days of the most recent event, not including the most recent event, order from most to least recent.

The problem gives us a database with the table events, as well as a sample dataset:

[table id=1 /]

How would we construct the solution by hand?

  1. We would look for the most recent date, in this case the event with id 4 is the most recent with an event_date of 2016-12-02.
  2. We would then SELECT all dates that occur within 7 days before 2016-12-02. This would select the events with ids 2 and 3.
  3. Finally we would sort the returned events based on event_date with ORDER BY.

With this strategy in mind, let’s start putting the pieces together!

Step 1: Find the most recent event

For this, we need to find the maximum value of the event_date column:

SELECT max FROM events

The time taken for this query will scale linearly with the number of events we have stored, which is about the best we can hope for. After all, to find the maximum we have to look at the date of every single record!

Here is a different approach that also selects the most recent date:

SELECT event_date FROM events ORDER BY event_date DESC LIMIT 1

This approach is to take all the dates, order them from latest to earliest, and then select the first one off the list. This approach isn’t as good, because it isn’t as clear what the code is trying to do. If you are reviewing this code, you have to get to the very end of the line to realize that we’re only picking one value. You have to carefully think if you want to order the list in ASCending or DESCending order. A naive implementation of this query would also be slower than our first attempt, because first the entire list would need to be sorted before the most recent event was selected (MySQL is almost certainly smart enough to optimize this away). Finally, if our goal is to get a short solution, this line is less clear and more verbose than our attempt based off the max function.

Now we need to write a query that uses the latest date and selects events in the 7 days prior to our event. We can either use our query above as a subquery, or store the results as a variable and use that. We will start by using the variable @recent, as it leads to more readable code, and then optimize once we’re done.

Our solution so far looks like this:

CREATE PROCEDURE pastEvents()
BEGIN
    SET @recent = (SELECT max(event_date) FROM Events);
       /* can also use
            SELECT MAX(event_date) INTO @recent FROM Events;
        */
END

Step 2: Select events within 7 days of @recent

Given an event_date, we need to determine if it was within 7 days of @recent. MySQL provides 60 functions for dealing with date calculations. Some of the options relevant to us are:

[table id=3 /]

All of these include the most recent event. We can throw out the most recent event by adding another DATEDIFF, or by using BETWEEN to ensure that our DATEDIFF is also bigger than 1. Because we are assured that there is at most one event per day, eliminating the most recent one can be achieved with event_date != @recent.

We now add a SELECT with a WHERE clause to get the events within 7 days, so our code now takes the form:

CREATE PROCEDURE pastEvents()
BEGIN
    SET @recent = (SELECT MAX(event_date) FROM Events);
       
       SELECT name,event_date FROM Events WHERE DATEDIFF(@recent, event_date) <= 7 AND event_date != @recent;
END

There is a really sneaky way of eliminating the extra clause to see if the event is the most recent event. We are interested in selecting events where

 

Instead of using subtraction, what do we know about the reciprocal of date differences? We know that

 

 

We get the lower limit by @recent being 7 days ahead of event_date. If the difference between the dates is 8 days, then 1/(@recent-event_date) is . The cool part is the upper limit is achieved if @recent and event_date are one day apart. If we look at the case where @recent and event_date are the same day, then 1/(@recent-event_date) is undefined, so MySQL will return false for any comparison. This means we don’t have to worry about the upper limit, and instead we only care if

 

 

Writing this another way:

 

 

Our shorter (but conceptually harder to understand) solution is

CREATE PROCEDURE pastEvents()
BEGIN
       SET @recent = (SELECT MAX(event_date) FROM Events);
       SELECT name,event_date FROM Events WHERE 7/DATEDIFF(@recent, event_date) >= 1;
END

Step 3: Order the output

Once we’ve gotten this far, it’s simply a case of adding an ORDER BY clause to the end. Finally we have something that passes the test cases and the hidden tests! Let’s call it a solution with 171 characters:

CREATE PROCEDURE pastEvents()
BEGIN
       SET @recent = (SELECT MAX(event_date) FROM Events);  
       SELECT name,event_date FROM Events 
              WHERE 7/DATEDIFF(@recent, event_date) >= 1
              ORDER BY event_date DESC;
END

Refinement 1: Subclause

It’s clear that we could get shorter code by naming @recent something like @r. We can also eliminate some characters by running a subclause to get the maximum date:

CREATE PROCEDURE pastEvents()
BEGIN  
       SELECT name,event_date FROM Events,
              (SELECT MAX(event_date) as r FROM Events) recent
              WHERE 7/DATEDIFF(r, event_date) >= 1
              ORDER BY event_date DESC;
END

This code JOINS every event with the single value from the subclause SELECT MAX(event_date) as r FROM Events, which is just the variable we stored in @recent in our previous example. We want to be careful with JOINS, because the number of rows in the returned table is the product of the rows in the two input tables. In this case, the derived table recent only has one row, so we only have to sort through the same number of events as we started with.

Refinement 2: Change >= to >; eliminate BEGIN and END

We can also eliminate a character by noting that DATEDIFF has to return an integer, so requiring 7/DATEDIFF(r,event_date) >= 1 gives the same dates as 8/DATEDIFF(r,event_date) > 1. MySQL doesn’t require BEGIN and END for procedures, so we can eliminate those too. This gets us to:

CREATE PROCEDURE pastEvents()
       SELECT name,event_date FROM Events,
              (SELECT MAX(event_date) as r FROM Events) recent
              WHERE 8/DATEDIFF(r, event_date) > 1
              ORDER BY event_date DESC;

The shortest solution

The very short solutions to this challenge took a very different approach from the one I just demonstrated. Here’s the code submitted by CodeFighter gennadi_m:

CREATE PROCEDURE pastEvents()
    SELECT e.name, e.event_date FROM Events e, Events v
           GROUP BY 2 DESC
           HAVING 8/datediff(max(v.event_date), event_date) > 1

The first step is taking the Event table, and taking the cartesian product (i.e. OUTER JOIN) with itself. In our case, this gives us a 25 row table:

[table id=4 /]

The columns are grouped by “column 2” in the output, e.event_date. The group by is only important when we are using an aggregate function, which is this case is max. The first group on this table is the group e.event_date = 2016-11-07, for example. Out of this group, we ask for the maximum v.event_date, which is 2016-12-02. We can only use aggregate properties, so our table becomes:

[table id=5 /]

The combination of using a self-join, grouping on event date,  and max aggregation is just a complicated way of adding the most recent date to every row.

While the query is a lot smaller, joining a table with itself squares the number of entries in it. MySQL might be able to optimize your query in some cases, but you want to be careful before trying this trick in production code!

Tell us…

How did you tackle this challenge? What do you think about the approach I took in solving it, and the refinements I added? Let us hear from you in the CodeSignal forum!