LinkedIn Interview · A Large Properly Formatted Data File

05 Oct 2017, 14:23

Jobs

interview

I interviewed today with LinkedIn and [REDACTED]. He was intimidatingly knowledgeable and described a work environment that I really hope to join. I hope that I have the skill set necessary.

What follows are notes of the responses for my edification, not necessarily how I answered.

Describe what happens when we type ssh shell.linkedin.com and hit enter.

When you hit enter, your shell gets the string you entered on the command line. In this case, we send the string to bash.
Bash will parse the string for whitespace and take arg[0] as the name of the executable you intend to execute.
It will lookup said executable in several locations:
- builtins
- functions
- aliases
- $PATH
As an aside: Why must cd be a “builtin”?
- This is because if you were to spawn cd from the current shell like you would another program, cd would execute, make the syscall to change its own directory, and exit. This would leave the parent process’s environment unchanged.
Once it finds what you mean by that arg[0], it will fork and call the executable you intend to run.
Now ssh gets that list of arguments can use something like getopts to parse meaning out of that.
Assuming that the argument parsing went well, we can now try to resolve the hostname that ssh has been passed.
This involves looking in /etc/hosts for matching hostname:ip pairs, or failing that, will read resolve.conf to find a DNS server to execute a recursive DNS lookup.
This is likely encapsulated in a syscall to the kernel. DNS resolution happens over UDP.
Once you have the IP, you can then establish a TCP connection to the server on port 22 (magic number).
After the TCP connection is established, you can look here for the SSH Review

You have a system where 10000 clients need to access 10GB of information from a server. The data changes a few sectors a day.

rsync is the key to this question. With rsync, you can establish a connection to a remote server and synchronize files over the connection, sending only what needs to be sent.
Karrick shared with me a utility he discovered in the interview process, ssync, which solves the distributed portion of this challenge too.
Without ssync, you will run in to the issue that when a change happens on the server, all the clients are notified. This will cause a bottleneck at the server.
To mitigate this, you can setup a self-similar hierarchy whereby the server has 10 nodes it broadcasts to and those nodes synchronize files with it. Each of those nodes in-turn has 10 nodes. Once the files are synchronized to any level, the next level will get a notification.

You just inherited a software system that has no metrics. Describe what you do in the first quarter to allow you to sleep at night.

Firstly, I would get a trial of SignalFX, because it is the only platform of this kind that I know.
SignalFX has a fork of collectd that it uses to gather information from nodes.
Giving SignalFX your AWS credentials will allow it to identify all the hosts that you run. EC2, RDS, other things.
Once that data is in SignalFX, you can begin to create dashboards and setup alerts.
Those alerts can be piped to PagerDuty or Karrick’s recommendation, Iris. He wrote Iris, and it seems like an awesome open-source solution to the problem that PagerDuty solves.