AKS with AGIC existing AGW

TL;DR Some identity gotchas when deploying an AKS Application Gateway Ingress Controller with an existing Application Gateway using Terraform

Jump to solution

This will be a writeup on a challenge I faced recently with deployment of an AKS cluster not being able to control the Application Gateway we had deployed for it. Spoiler alert! It was identity and permissions related.. All code can be found here on GitHub.

What is AGIC?

The Application Gateway Ingress Controller (AGIC) is a Kubernetes application, which simplifies configuration of Application Gateway (AGW) rules from AKS workloads. It enables you to focus on your Kubernetes applications rather than the backend pools, backend http settings, and other AGW specific config. You can read more about it here.

Deploy with new Application Gateway (Greenfield)

Greenfield deployment assumes you have nothing deployed from before, and that you will deploy both AKS and AGW at the same time.

If you do it this way, you have less control over the AGW configuration, and AKS will deploy it for you. This method will configure all necessary permissions and resources needed, and is simpler to perform.

As far as I can gather, this is the way of doing greenfield deployment via Terraform for AGIC:

resource "azurerm_kubernetes_cluster" "k8s" {
  name                    = var.aks_name
  location                = azurerm_resource_group.rg.location
  dns_prefix              = var.aks_dns_prefix
  private_cluster_enabled = true

  resource_group_name = azurerm_resource_group.rg.name

  http_application_routing_enabled = false

  [ ... ] # Redacted for readability

  ingress_application_gateway {
    subnet_id = azurerm_subnet.appgwsubnet.id
    subnet_cidr = "10.0.0.0/24"
    gateway_name = "agic-appgw"
  }

  [ ... ] # Redacted for readability
}

This code will create an Application Gateway for you in the nodepool resource group, within the subnet id provided with subnet cidr provided. Also any permissions needed will be handled automatically by AKS.

Deploy with existing Application Gateway (Brownfield)

Brownfield deployment assumes that you have AKS and AGW already deployed, and will enable the addon for this. The complete code for all resources can be found here.

Do it like this, and you have complete control over deployment of the AGW, but you need to configure permissions correctly for it to work.

This is the method I have been troubleshooting, and it is the focus of this post.

This is the way of doing brownfield deployment via Terraform for AGIC:

resource "azurerm_kubernetes_cluster" "k8s" {
  name                    = var.aks_name
  location                = azurerm_resource_group.rg.location
  dns_prefix              = var.aks_dns_prefix
  private_cluster_enabled = true

  [ ... ] # Redacted for readability

  ingress_application_gateway {
    gateway_id = azurerm_application_gateway.network.id
  }

  [ ... ] # Redacted for readability

}

Notice only the Application Gateway id being provided. The backend will fetch any necessary information from the provided resource id, and link the two resources together. This assumes you have created some supporting infrastructure, and that you have given appropriate permissions.

Troubleshooting

This is where it gets interesting.

First I tried giving permissions to the SystemAssigned managed identity for the AKS cluster in main.tf:

resource "azurerm_role_assignment" "aks_id_contributor_agw" {
  scope                = azurerm_application_gateway.network.id
  role_definition_name = "Contributor"
  principal_id         = azurerm_kubernetes_cluster.k8s.identity[0].principal_id
  depends_on           = [azurerm_application_gateway.network]
}

This did not work, and I was presented with a non-working configuration. To further dig into this I needed to use kubectl and navigate the cluster. In this case I am using kubectl from a jumpbox in the same vnet, because it is a private cluster.

I used this command to get the logs from the ingress-appgw service:

kubectl get pods --all-namespaces
kubectl logs ingress-appgw-deployment-<random> -n kube-system --tail 100

First I list the pods in all namespaces, and I find my ingress-appgw-deployment pod. Then I can get the last 100 lines of logs from it.

The logs show pretty clearly that the permissions have not been granted. Strange.. (Don’t worry about the client ids as they will be long gone once this post is up)

To troubleshoot this further, I wanted to search for this client id in my enterprise applications. I searched for the object id which starts with 9843 from the screenshot above.

This shows me a new identity I did not know about. It seems the AKS creates a dedicated identity to manage the Application Gateway! 🤯

After some RTFM’ing I found the needed configuration settings buried in the documentation page:

The “object_id” from this exported block is part of the solution. To be honest, the docs are somewhat unclear. I checked the content of client_id, object_id, and user_assigned_identity_id with Terraform output, and they never contain the uami “used by the Application Gateway”. They contain a separate identity created and associated somehow to AKS.

You find it by using az aks show -n aksclustername -g resourcegroupname.

Logically it should be enough granting Contributor permission on the AGW to AKS AGW User Assigned Managed Identity (aks_ingressid_contributor_on_agw below). This should permit AKS to manage the AGW. Weirdly enough we still get an error..

You either get:

Code=”LinkedAuthorizationFailed” Message=”The client ‘client_id‘ with object id ‘object_id‘ has permission to perform action ‘Microsoft.Network/applicationGateways/
write’ on scope ‘/subscriptions/subscription-id/resourceGroups/resource-group/providers/Microsoft.Network/applicationGateways/agw-name‘; however, it does not have permission to perform action ‘Microsoft.ManagedIdentity/userAssignedIdentities/assign/action’ on the linked scope(s) ‘/subscriptions/subscription-id/resourcegroups/resource-group/providers/Microsoft.ManagedIdentity/userAssignedIdentities/your-uami‘ or the linked scope(s) are invalid.
 
Or you get:

Code=”AuthorizationFailed” Message=”The client ‘object_id‘ with object id ‘object_id‘ does not have authorization to perform action ‘Microsoft.Network/applicationGateways/read’ over scope ‘/subscriptions/subscription-id/resourceGroups/resource-group/providers/Microsoft.Network/app
licationGateways/agw-name‘ or the scope is invalid. If access was recently granted, please refresh your credentials.
 
I am not quite sure if this is because of some credential caching, or if it’s just unpredictable behaviour 🤔
 
The apparent reason for these errors is the missing permissions for AGW User Assigned Managed Identity (azurerm_user_assigned_identity.identity_uami below). These are granted in “uami_contributor_on_agw” below.
 
Also playing a part in this issue is the AKS AGW User Assigned Managed Identity not having permissions on the AGW User Assigned Managed Identity. These are granted in “aks_ingressid_contributor_on_uami” below.

This code should grant permissions to the correct identities:


(Code can be found here)

The pod will auto-retry every 10 seconds, and after the permissions are corrected, the logs output from ingress pod looks like this:

Bonus voting game

You can test the ingress with a fun voting game from Microsoft deployed with this manifest. A big thanks to Thomas Thornton and Microsoft Docs for the great deployment.yaml, which I have modified somewhat to fit some new syntax.

  • SSH to your jumpbox with ssh adminuser@jumpboxpip
  • Install Azure CLI and Kubectl
  • Log in with az login
  • Get kubecontext with az aks get-credentials -n clustername -g resourcegroupname –admin
  • Run nano deployment.yaml from the jumpbox.
  • Copy/paste the contents of deployment.yaml into your terminal.
  • Exit nano with Ctrl+o then Ctrl+x
  • Run kubectl apply -f deployment.yaml
    • ProTip: You can also do kubectl delete -f deployment.yaml to remove pods++ again.

Deployment:

Public IP of your application gateway should be in the output from terraform apply earlier:

Will look like this in your browser:

In summary

Enabling Application Gateway Ingress Controller on a pre-existing Application Gateway can be quite an experience if you haven’t got the permissions sorted out. This gets particularly bad if the Application Gateway is associated with a User-assigned Managed Identity This can cause hours of troubleshooting, and hopefully I have saved you these hours now 😀

I have deployed and redeployed these resources many times during testing and troubleshooting, so the id’s and resource names have changed. They are deleted and not part of any remaining solution, which is why I have let the client_ids and object_ids be in cleartext here.

Sources

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.