Following on from my blog post on the basics of SIP call setup, I wanted to build on that to incorporate how media is negotiated between clients. Again, this topic has been blogged by others and the purpose of this is to put my understanding into my own words so that I understand and if it helps you, then great!
When a user initiates a call to another endpoint to begin the conversation a SIP INVITE is sent from the calling party to their SIP Proxy which forwards that request. I covered the SIP portion of the message in my previous article here, but SIP in itself will not fully establish the call flow, it will simply connect two or more endpoints together for communication. For voice to pass between endpoints a media stream is required. Whether this is direct between clients, known as P2P (Peer to Peer), or via a media relay server such as the Skype for Business Edge Server, or Mediation Server.
When voice establishes in P2P both endpoints in the conversation e.g. client laptops are responsible for all the processing power required to sample analogue voice and digitize it into a binary value and vice versa. The media is connected directly between the clients and requires no assistance from Skype for Business Mediation, Edge or Front End Servers. There is one caveat to this, in that if P2P cannot be established between two internal clients directly over the network, then the Edge server is used as a media relay server between clients.
Where a media relay server is used (or several in the entire path), each endpoint negotiates media directly with their local media relay server and each media relay server negotiates media with the next media relay server in the call path and so on. Therefore, a voice stream can be encoded and decoded into several formats along the call path. Theoretically you could be encoding SILK between client and media relay server, then G.711 from media relay server to another media relay server and from that G.729 to the destination client. However, all this encoding and decoding has a performance impact on resources and time. So Skype for Business compensates for this and supports multiple codecs at client and server. This allows the client to negotiate based on call type which is the best codec to use. For example, when calling a PSTN number, the Skype for Business client will almost certainly choose to use G.711 between client and media relay server (mediation). This is because G.711 is the most popular PSTN codec and by selecting this means that the mediation server does not need to decode and encode into another format, but simply proxy the media between client and Session Border Controller. The result is less hardware and more calls per server because the resource requirement is significantly less. It also means that your media gets to the destination faster for a more real-time audio experience.
It is worth noting here that a SIP endpoint (IP address of a client or SIP Proxy Server) does not mean that this address can be used for media. The Media endpoint (IP Address) could and often is different to that of the SIP endpoint. Especially when media relay servers are used. Think to how your Edge server operates. It has 3 public IP addresses. 1 IP is for SIP signaling another for web conference media and another for voice / video media. If we sent media to the SIP endpoint then no one would get voice, because the edge server would not be expecting voice traffic at that IP. It expects it on the AV IP instead.
So how does each client / server in the call path know what codec and what media destination to use?
This is done using a protocol called SDP (Session Description Protocol). As the name suggests it’s a protocol that “describes” the type of conversation in the session (audio or video or both). SDP cannot carry or transport itself over the network layer, and therefore “piggy backs” on the SIP protocol as MIME content. Obviously, SIP uses UDP or TCP as the transport protocol at network layer, so SDP can be exchanged by endpoints.
The MIME content in the message body is described by the CONTENT-TYPE SIP header. When SDP is encorporated within the SIP message Body the Content-Type is usually “application/sdp”. However, Skype behaves differently when communicating over an Edge server, and you will see “multipart/alternative;boundary” as the Content-Type in your first SIP INVITE message. The CONTENT-LENGTH is the length in bytes of the SIP body. In the below example we can see that our SIP INVITE message is 5.9KB in size.
To describe what multipart/alternative means, we first look at the word “multipart”. As the word suggests, it says that the message body contains multiple data parts, or contains more than one set of data. The next word “alternative” means that the data parts within the body are linked to the same use. The “boundary” element is used to describe where the data parts start and end and can be used to direct towards the data part to use. Here we can see that we are telling the called party’s media relay server to use the data part within boundary ending in “E4DF0” but if the first data part with the boundary name fails, there is an alternative data part in the message body. So in english, this means there are two data parts within the same boundary. If the first one fails, use the next data part with the same boundary name.
Now let’s look at the message body.
Here we see the first part of the SDP body. The first item I want to call out is CONTENT-DISPOSITION. This tells us what the contents of the SDP body is to be used in and for. We can see here that there is an attribute called MS-PROXY-2007fallback. This attribute states that this SDP body is for fallback support for OCS 2007 and earlier systems. We see this in Skype to Skype calls over federation. This is included in case the recipient’s UC platform is OCS 2007 and we need to communicate with that using legacy methods i.e. 50,000 TCP/UDP port range.
Jumping to the other data part within the boundary, we can see this SDP body is for Skype for Business to Skype for Business / Lync 2013 / Lync 2010 communication. Fundamentally both data parts do the same thing, but the fallback SDP contains a different method of discovering candidates (explained later). Incidentally, Skype for Business is intelligent enough not to use the legacy method in a Skype for Business to Skype for Business over federation conversation, so just because fallback support is above the preferred SDP body doesn’t mean Skype will use it if it doesn’t need to. Whereas, OCS processes SDP body content in order of appearance (i believe) hence, why fallback is listed first.
So what does all these V’s, O’s S’s etc. mean in SDP? Let’s take a walk through them all.
- V= is the attribute that contains the SDP version number. In this case the version is 0 (zero).
- O= is the attribute that contains information about the Orgin of the SDP body. The first part contains the username, in the above example not required so it is displayed as a – .The next part contains the Session ID which is 0 (zero). The next part is the Session Version which is 1. Then the network connection type, displayed as “IN” for INTERNET, the network address type which is IP version 4 (IP4) and lastly the IP address of the media endpoint. In this case we are connected via External Edge and the IP address of the media relay interface is used.
- S= is the attribute that contains the Session Name, in this case the name is simply Session
- C= is the attribute that contains the network type (IN = Internet), address type (IP4) and connection address (132.xx.xx.xx) we want to connect to for media
- B= is the attribute that contains the proposed bandwidth that is to be used for media. CT means that this is the total figure for all of the media in the conversation. The bandwidth is displayed in bytes, so 99980 is 100KB
- T= is the attribute that contains the start and stop times for a session. The first element is the start time, the second the end. 0 0 means that we are not specifying a time and we are using SIP to decide on start and end of a session.
- A= this declares a custom attribute to extend SDP, you can and will have multiple a= lines in an SDP body. a= lines contain further information about the media types and supported features.
- M= declares the media description. Again you can have multiple lines containing M attributes, commonly referred to as “Multiple M Lines”. In a Skype for Business SDP body you would see 2 M lines for a video call, One for Video and one for Audio.
Now that you know what these lines do, let’s look how they help establish that media stream between endpoints.
Firstly, this line. “A=” declares it an extendable attribute and the name of this attribute is X-DEVICECAPS. This attribute provides the media capabilities of the endpoint. In this example the endpoint can send and receive audio and also send and receive video. If the capability check returned just send receive audio and you tried to perform a video call, the video stream would be denied.
The M Line, or media description line contains information about the type of media we are attempting to negotiate. In this case we want to negotiate Audio. The 55966 value is the port which the endpoint sending this SDP information can use to receive audio. RTP/AVP declares the protocol we want to use for the audio stream. RTP = Real Time Protocol AVP = Audio Video Protocol (interchangeable). The numbers after this relate to the types of codec the endpoint can support. The numbers order is important, because it declares the codec the endpoint prefers to use for this type of call. The order is first = preferred last = least preferred. So 118 i would like to use, but if all others in between fail, then 101 is my last choice before I say I can’t support the type of media.
Jumping to the codec map attributes now, this is where those numbers in the M Line relate to. So if we look at the first number from the M Line, 117, we can see we have an RTPMAP (Real Time Protocol Map) attribute with an element 117. The codec relating to this number is G.722 sampling at 8KHz (really 16KHz) and the 2 declares we want to use Stereo (2 channels of audio) instead of Mono. This is the preferred codec the client wants to use for this call.
I won’t spend too much time breaking down each codec, but I will mention 101 telephone-event/8000. This means that this audio stream supports inband DTMF. This allows DTMF tones to be sent over the same stream. If inband was not allowed, then either a separate stream for DTMF would be required or DTMF could be sent using SIP INFO/NOTIFY messages for out of band DTMF.
Now that we have declared our preferred codecs, how do we advertise our media endpoint address? The media will need to be connected from the far end and be able to reach back to the calling party. This is done using the ICE or Internet Connectivity Establishment Protocol. ICE is embedded in SDP.
ICE is encapsulated in A= Line attributes within SDP. From the outset the Skype for Business client will include all it’s known endpoint addresses. In this case you can see my client advertising its internal IP 192.168.1.225, my home routers public IP 184.108.40.206 and also my Edge Server’s AV IP 220.127.116.11 as available candidates for media establishment. You will also note that each candidate is a pair. a=candidate:1 1 and a=candidate:1 2 and so on. A pair is needed because 2 declares the audio sending IP and port and 1 declares the audio receive IP and port. You can also see the network transport protocol used UDP and the type of the endpoint this candidate relates to, “host” means client workstation / desk phone etc.
This would work for internal to internal communications, but when connected via an Edge Server, we need a candidate that is discoverable over the internet. Therefore, we are also declaring our Edge Server’s AV IP 18.104.22.168 using Port 55966 for audio send but we are declaring the Edge server as a type “relay”. This means it is a media relay server and therefore, needs to be told where to relay the media to. This is provided to it by the client by incorporating the relay address “raddr” of 22.214.171.124 which is the public IP address of my home router and the rport or relay port is the port the router will accept this media stream on.
This media stream will then get back to the client workstation sending it by using STUN and TURN protocols.
One last attribute to probably mention is the SilenceSuppression attribute. SIP has the option of not sending RTP packets when there is silence on the line (i.e. no one speaking). Disabling this feature means that RTP packets will still be sent and received during silence. This is disabled for stability and to stop devices thinking there is no media and therefore, tearing down the call.
Now that the SIP INVITE has been sent to the called party we are waiting for the 200 OK response back. In that 200 OK response will be the called party’s SDP information. Here is the SDP extract from that packet
Notice the M Line, the first codec of choice is 104, which relates to an rtpmap attribute that equates to SILK/16000 (SILK Wideband) as the preferred codec of the called party’s endpoint. So to recap the called party’s endpoint prefers to use 117 G.722.2 / 8000 and the calling party wants to use 104 SILK Wideband.
The calling party’s endpoint acknowledges receipt of the SDP information by sending an ACK to the 200 OK message it received.
As the called party’s endpoint doesn’t have 117 in the list of codecs it supports, the calling party’s client will offer it’s next preffered codec, which is also 104 SILK Wideband. “Hey Presto!” we have a match. Next the client needs to decide which is the best candidate to use to connect the media stream to. The ICE candidate list is tried in order of appearance and the first candidate to respond with a valid path is the candidate chosen to connect media to. Once everything has been agreed, another SIP INVITE message is sent to the called party’s endpoint containing the agreed and negotiated SDP information to use to connect the media.
This is an extract of the SDP information in the second SIP INVITE. As we can see, we have agreed to use the SILK Wideband Codec and we are connecting directly to the other client endpoint without a media relay server. This is because both clients, although on different Skype for Business infrastructure i.e calling party is Skype for Business Server and called party is Skype for Business Online are on the same internal network as each other. If this was a true federated call the candidate containing the Edge AV IP of the called party would be used.
Notice that we are still sending all of the codecs we can use between the clients. This is sent for fast renegotiation if network impairments prevent SILK Wideband use mid call, allowing the conversation to continue without either party having to hang up and dial again.
The called party’s endpoint will send a 200 OK response back to the calling party to inform it that it accepts this SDP information.
The calling party’s endpoint then sends a ACK back to the called party and media is established and you can now talk!
I hope this helps you understand the basics of media establishment.
Mark is an Independent Microsoft Teams Consultant with over 15 years experience in Microsoft Technology. Mark is the founder of Commsverse, a dedicated Microsoft Teams conference and former MVP. You can follow him on twitter @UnifiedVale