Cleaning Up Stale PKS Kubernetes LoadBalancer IP allocations in NSX-T

Posted On // Leave a Comment
I was working in a proof of concept environment using NSX-T where we didn't have a lot of IPs in the Floating IP pool for k8s clusters provisioned by PKS. We had two clusters deployed and we were trying to start up a few pods with a LoadBalancer service. The problem we hit was that the pods wouldn't startup, and were failing in the "Init" status. We weren't seeing enough details via kubectl, so we found the node that was trying to start the Pod's containers, and checked the kubelet.log for more details. Interestingly, we noticed some messages about NSX-T right before the pods failed. This got us thinking that although these particular pods didn't have any special initialization, NSX-T was doing some work to try and allocate resources for the Pod to expose it via a LoadBalancer service.
On a hunch, I checked the NSX-T Manager, and went to Inventory -> Groups -> IP Pools section, and noticed that the Floating IP Pool had all the IPs allocated! Come to find out, someone had deployed another cluster without our knowledge, and we had run out of Floating IPs. We deleted this extra cluster, and that got us going again for the time being. However, I noticed that there were a bunch of IPs allocated from the pool for only two clusters. I could understand 1 IP for the masters (NSX-T gets a Load Balancer configured for the masters in the cluster), and each cluster had 1 additional LoadBalancer service provisioned, but that didn't account for all the extra IPs that had been allocated. We tried doing traceroutes to all those allocated IPs and found that many of them were unresponsive. We found out that also someone during testing had deleted some clusters with BOSH without using the PKS CLI. They had cleaned up all the objects they could find in NSX-T, but hadn't taken care of the IPs from the Floating IP Pool.
To clean those up, I simply called traceroute against each of the allocated addresses in the Floating IP Pool, and then called the NSX-T API to release the ones that weren't responding. The API call to remove IPs allocated to an IP Pool. This was the call I was able to make against NSX-T 2.3:
curl -k -u : -X POST 'https:///api/v1/pools/ip-pools/?action=RELEASE' -H "X-Allow-Overwrite: true" -d '{"allocation":""}' -H "Content-Type: application-json"
Replace the parts in angle brackets above with your info.
[Read more]

Strategies for Parsing Service Information In Cloud Foundry

Posted On // Leave a Comment
Cloud Foundry’s service marketplace provides self-service access to a curated set of services that have their lifecycle automatically managed by the platform.  Part of the lifecycle of any service instance involved connecting that service to an application through a process called “binding”.  There are a variety of mechanisms that an application can use to lookup the information that a binding represents, and I’ll try to write up some of the best practices.

When a service instance is bound to an application in Cloud Foundry, the connection information and credentials for that service instance get exposed to the application as a block of JSON in an environment variable called VCAP_SERVICES.

The format of that JSON is similar to the following:
{
  "service-short-name": [
    {
      "binding_name": null,
      "credentials": {
        "service-specific-key1": "service-specific-value1",
        "service-specific-key2": "service-specific-value2"
      },
      "instance_name": "MyService",
      "label": "service-short-name",
      "name": "MyService",
      "plan": "service-plan-short-name",
      "provider": null,
      "syslog_drain_url": null,
      "tags": [
        "tag1",
        "tag2"
      ],
      "volume_mounts": []
    }
  ]
}

Use a Library

When looking up service information, you can simplify the code you have to write by using a library like Spring Cloud Connectors (https://cloud.spring.io/spring-cloud-connectors/) for both Spring and Non-Spring based Java applications or Steeltoe Connectors (http://steeltoe.io/docs/steeltoe-connectors/) for .NET and .NET Core based applications.  These libraries have strategies for looking up services that are resilient enough to work with many different service providers and types, and can even automate connections to User Provided Services (https://docs.cloudfoundry.org/devguide/services/user-provided.html) provided those service instances contain credentials that look similar to a brokered service of the same type.  These libraries also take care of building connection objects for these services and (if applicable) injects those objects into a Dependency Injection context to make it easy for your application to use those services.

Please, Use a Library

If you don't like the way that Spring Cloud Connectors, or Steeltoe Connectors does all the work for you, you could simple use a library that simply parses out the VCAP_SERVICES into for you.  Steeltoe Configuration (http://steeltoe.io/docs/steeltoe-configuration/#1-2-3-access-configuration-data) and Spring Boot (https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/cloud/CloudFoundryVcapEnvironmentPostProcessor.html) will allow you to automatically parse out the VCAP_SERVICES into framework specific configuration information that you can reference easily via the name of the service that is bound to the application.

Manual Parsing

If you can't use these libraries for some reason, you need to be careful in the way you parse out the JSON data.  You want to ensure that you at least don't hard code the keys used to traverse the JSON service information that could change without your knowledge. You also want your logic to be flexible enough to use other service brokers than might provide the same core service, but implement that service in different ways.

To start with, all service bindings that get exposed to your application in Cloud Foundry are broken out by the short name of the service type used to provision them.  For example, the broker for Redis Labs Enterprise Cluster uses the short name "redislabs" for its service offering, Pivotal uses "p-redis" for its service offering, and all user provided services use the short name "user-provided-service".  Each of these service types could provide the information about a Redis endpoint for your application to talk to, so you want your application to be flexible enough to use any of these options to make it easier for the people who have to deploy and operate your application.  However, if you write your parsing code to only look for the key "p-redis" in the top level object your wouldn't find your service information when someone used a user provided service or RLEC service for your application.

One strategy to deal with this is to simply loop over all the top level keys, then to loop over each array that each of the top level keys is associated with, and then to look within each object in the array for information that your would have some control over.  One simple approach would be to look for a service bound to your application that had a particular name.

Pseudocode for that might look something like this:
var vcap_services = Get the value of the VCAP_SERVICES environment variable;
foreach ( service_type in vcap_service.keys ) {
  foreach ( service_instance in service_type[service_type] ) {
    if( service_instance.name == "MyService" ) {
      Extract service info and build connector to the service;
    }
  }
}

The Cloud Foundry Cloud Controller API v2.99.0+ (PAS 2.1+) includes the ability to allow the name that a service is exposed to an application as to be set at binding time.  In this way, an application could have a specific name it looks for like "MyService", but the actual service name could be something completely different.  When binding a service via the CF CLI, you could execute the following command to specify a "binding name":
cf bind-service MyAppName SomeServiceNamedStrangely --binding-name MyService
Doing the above sort of command when binding would allow your application to still lookup the service by the name MyService, no matter what the service instance was actually named.

A More Flexible Method

Naming is the hardest problem in Computer Science, so it is no surprise that there are conflicting interests that might come into play around how your services are named.  If you want to make your application able to handle the most flexible service possibilities, you could actually replicate the same pattern that the above libraries use to find bound services.  Both Steeltoe Connectors and Spring Cloud Connectors apply a strategy which involves looking at the service's tags or for the scheme of a URI in the service's credentials section.

The tags are a multi-valued array that is commonly supplied by the service broker for a particular service.  You can supply additional tags to add to the service on creation, or you can update a service with additional tags after you have created it.  You need to be using Cloud Foundry Cloud Controller API v2.104.0+ (PAS 2.2+) to have support for setting tags on user provided services.

For many services, you can express the connection information via a URI.  Steeltoe Connectors and Spring Cloud Connectors both look for the keys "uri" or "url" _or_ for those keys prefixed with the "scheme" that matches the service type (like "redisUri", or "redisUrl").  If your service type supports a URI style for connectors, this might be a great way to lookup your services no matter how they were provisioned.

Either of these lookup strategies could be much more robust than the previous methods as they rely on data that is in your direct control.  You could combine some of the above techniques as well.  For instance, you could look for URIs with specific schemes, but then disambiguate them with their service name if you had multiple services of the same type bound to your app.
[Read more]

Local Troubleshooting Technique for CloudFoundry HWC Apps Running in Windows 2012R2

Posted On // Leave a Comment
Cloud Foundry gives us a simple way to get Windows applications normally hosted in IIS to production quickly with the HWC Buildpack.  However, when the application fails to stage or run, it can be difficult to figure out what is going on.  You can actually run your application locally in a similar way to the way the HWC Buildpack would cause your application to run in Cloud Foundry by running the HWC executable locally against your app.

The HWC Buildpack relies on the HWC project to actually run your application in Cloud Foundry.  The HWC process uses Microsoft's Hostable Web Core to run a mini version of IIS in a process that your application is hosted in.  The HWC project creates releases of the executable that you can download an run locally on your workstation.

Before running HWC, you'll need to make sure your workstation has some pre-requisites installed.  If you go to the Running the Tests (Windows Only) section of the README.md in the HWC Project, you will see a heading with the text "Install Windows/.NET Features from Powershell by running".  Execute the Powershell commands in that section to install the necessary Windows Features to be able to run hwc.exe.

At this point you can set any environment variables you need (particularly the PORT environment variable), and launch the hwc.exe process pointing at your IIS host-able application.  The PORT environment variable is the only required environment variable, and that controls the TCP port that hwc.exe will attempt to bind to when running your application.  You specify your application directory by passing an -appRootPath parameter to HWC that points to a directory with a Web.config file and the rest of of your application bits.

Here's an example taken from the HWC project's README.md of running the HWC process using Powershell:
& { $env:PORT=8080; .\hwc.exe -appRootPath "C:\wwwroot\inetpub\myapproot" }

You should see a response similar to the following:
Server started for c21246d3-d9de-4014-810f-ad3130214a79

At this point your could open a web browser to localhost:8080 (or whatever port you set in your PORT environment variable) and browse your site.  If you get errors, you can check the Windows Event log, or terminal window to see what is going on.
[Read more]

PowerCLI Script to Recover Pivotal Operations Manager VApp Settings

Posted On // Leave a Comment

If you've read my previous blog posts, you know that I'm running a home vSphere lab on a shoestring budget with hardware that is "mostly" working "most" of the time.


One of my NUC hosts locked up recently and I was noticing that my Pivotal Operations Manager VM for Cloud Foundry just wouldn't use the static IP address I had assigned to it at install. The networking settings are stored as VApp Options that you can set when you deploy the Operations Manager OVA. I figured that maybe the failure caused those settings to get out of sync, so I tried to update them again and save the updates, but I kept getting an error from vCenter Web Client say that it "Cannot complete operation due to concurrent modification by another operation."


I thought something must be out of whack, so I removed the VM from inventory, and re-added it back. Of course when you do that you lose all the VApp options you had set, but you also lose the definition that they existed in the first place! So, I wrote a little PowerCLI script to recover those settings in case I lose them again so that I don't have to re-deploy the Operations Manager VM. Just copy and paste this text into your PowerCLI, and call "Set-OpsManConfig" to run it:


function BuildProp([int]$key, [string]$id, [string]$label, [string]$type, [string]$description)
{
    $prop = New-Object VMware.Vim.VAppPropertySpec
    $prop.operation = "add"
    $prop.info = New-Object VMware.Vim.VAppPropertyInfo
    $prop.info.key = $key
    $prop.info.id = $id
    $prop.info.label = $label
    $prop.info.type = $type
    $prop.info.userConfigurable = $true
    $prop.info.description = $description
    return $prop
}

function Set-OpsManConfig{
 Param(
  [Parameter(Mandatory=$true, HelpMessage="The Name of the VM to update")]
  [string]$VMName, 
  [Parameter(Mandatory=$true, HelpMessage="The IP for the OpsMan VM to use")]
  [string]$IP, 
  [Parameter(Mandatory=$true, HelpMessage="The Netmask to set for the OpsMan VM")]
  [string]$Netmask, 
  [Parameter(Mandatory=$true, HelpMessage="The default gateway to set for the OpsMan VM")]
  [string]$Gateway, 
  [Parameter(Mandatory=$true, HelpMessage="A comma separated list of DNS IP addresses to set for the OpsMan VM")]
  [string]$DNS, 
  [Parameter(Mandatory=$true, HelpMessage="A comma separated list of NTP Server IPs to set for the OpsMan VM")]
  [string]$NTP, 
  [Parameter(HelpMessage="The Hostname to set for the OpsMan VM")]
  [string]$Hostname = "",
  [Parameter(Mandatory=$true, HelpMessage="The password for the 'ubuntu' user for the OpsMan VM")]
  [string]$AdminPassword
 )

 $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
 $spec.VAppConfig = New-Object VMware.Vim.VmConfigSpec
 $spec.VAppConfig.OvfEnvironmentTransport = [string[]]"com.vmware.guestInfo"
 $spec.VAppConfig.Property = New-Object VMware.Vim.VAppPropertySpec[] (7)
 $spec.VAppConfig.Property[0] = BuildProp 0 "admin_password" "Admin Password" "password" "This password is used to SSH into the Ops Manager. The username is 'ubuntu'."
 $spec.VAppConfig.Property[1] = BuildProp 1 "ip0" "IP Address" "string" "The IP address for the Ops Manager. Leave blank if DHCP is desired."
 $spec.VAppConfig.Property[2] = BuildProp 2 "DNS" "DNS" "string" "The domain name servers for the Ops Manager (comma separated). Leave blank if DHCP is desired."
 $spec.VAppConfig.Property[3] = BuildProp 3 "netmask0" "Netmask" "string" "The netmask for the Ops Manager's network. Leave blank if DHCP is desired."
 $spec.VAppConfig.Property[4] = BuildProp 4 "ntp_servers" "NTP Servers" "string" "Comma-delimited list of NTP servers"
 $spec.VAppConfig.Property[5] = BuildProp 5 "gateway" "Default Gateway" "string" "The default gateway address for the Ops Manager's network. Leave blank if DHCP is desired."
 $spec.VAppConfig.Property[6] = BuildProp 6 "custom_hostname" "Custom Hostname" "string" "This will be set as the hostname on the VM. Default: 'pivotal-ops-manager'."
 
 #Set Values to whatever is appropriate for your environment
 $spec.VAppConfig.Property[0].Info.Value = $adminpw
 $spec.VAppConfig.Property[1].Info.Value = $ip
 $spec.VAppConfig.Property[2].Info.Value = $dns
 $spec.VAppConfig.Property[3].Info.Value = $netmask
 $spec.VAppConfig.Property[4].Info.Value = $ntp
 $spec.VAppConfig.Property[5].Info.Value = $gateway
 $spec.VAppConfig.Property[6].Info.Value = $hostname
 
 $vm = Get-VM -Name $vmname
 $vm.ExtensionData.ReconfigVM_Task($spec)

<#
.SYNOPSIS
Sets the VAppConfig settings for the OpsManager VM

.DESCRIPTION
Sets the VAppConfig settings for the OpsManager VM.  This command could be used on a non-OpsManager VM, but the results are undefined.
#>
}

[Read more]

Running RabbitMQ's PerfTest tool in CloudFoundry

Posted On // Leave a Comment
I recently had to troubleshoot performance of an app running in Cloud Foundry (Pivotal CF specifically) trying to use RabbitMQ.  The RabbitMQ team provides a great benchmarking tool that we can use to validate performance of a RabbitMQ cluster, and we can use that tool inside a container running in Cloud Foundry.
The following instructions assume you are using the CF CLI version 6.23.0+ (check with cf -v), and running against a Cloud Foundry that supports CC API v2.65.0+ (check with the cf target command after logging in to validate.)
  1. First, download the latest RabbitMQ PerfTest zip archive from the link in the above paragraph.  I used GitHub releases page for the project, and just grabbed the latest release.
  2. Next, paste the following contents into a file in the same directory as the ZIP file you downloaded called "manifest-rabbitperf.yml" (making sure to update the "path" part to reflect the actual name of the ZIP file you downloaded:
    ---
    applications:
    - name: rabbitperf
      instances: 0
      no-route: true
      path: rabbitmq-perf-test-1.4.0-bin.zip
    
  3. Now, open a terminal, and navigate to the directory you downloaded the ZIP file to, and push the tool to Cloud Foundry: cf push -f manifest-rabbitperf.yml
  4. If you want to test against a brokered instance of RabbitMQ, and you have that service installed in your instance of Cloud Foundry, you can create an instance of that service and a service key for it to use to test against. In my case, I had an install of Pivotal CF with the RabbitMQ tile installed, so I created a service instance with cf create-service p-rabbitmq standard myrabbit, and then created a service key for it with cf create-service-key myrabbit perfkey. Then, from the output of cf service-key myrabbit perfkey I was able to grab the first element in the "uris" array to run my loadtest against.
  5. Next, in the terminal, run an instance of the performance test with the following command (replacing amqp-uri with the uri from the service key you created above, or your preferred URI):cf run-task rabbitperf "JAVA_HOME=.java-buildpack/open_jdk_jre rabbitmq-perf-test-*/bin/runjava com.rabbitmq.perf.PerfTest -x 1 -y 1 -a -h " --name perftest1
  6. After launching the test, I could then monitor the RabbitMQ console for performance stats.  If you want to track the output of the PerfTest tool, you can execute cf logs rabbitperf in another window to track the output of that task run.
One note, the task command string above will cause the load test to run forever. You can stop the test by getting the ID of the running task with the cf tasks rabbitperf command, and then looking in the output for the ID of the running task you want to terminate. Then you can call (replacing with the ID of the task to kill) cf terminate-task rabbitperf to stop the task.
[Read more]

Pushing Your First .NET Core 1.0 RTM Apps to Cloud Foundry and PCF Dev

Posted On // Leave a Comment
Pushing a .NET Core 1.0 RTM application to Cloud Foundry is a fairly straightforward process.

Follow the instructions at https://www.microsoft.com/net/core to get your development machine installed with all the necessary binaries, and create the "Hello World" application as they show you on that page.

Next, go to https://docs.asp.net/en/latest/getting-started.html and either create a new project as that page shows you or modify your "Hello World" app from earlier using the instructions on that page.  Run your app locally per the last couple instructions on that page to make sure it works.

Next, we need to make some slight tweaks to the application to work better with Cloud Foundry.  Follow the instructions at https://github.com/cloudfoundry-community/dotnet-core-buildpack#using-samples-from-the-cli-samples-repository to add a dependency to your project.json to allow configuration of the Kestrel server via command line arguments.  You will also modify your Main method in the project to wire in the command line arguments to a Configuration object, and also make sure that the WebHostBuilder uses that new configuration.

Finally, push your application to Cloud Foundry using the following command (replacing SOME_APP_NAME with your own app name):

cf push SOME_APP_NAME -b https://github.com/cloudfoundry-community/dotnet-core-buildpack

Note: If you are using PCF Dev to test this in a local install of Cloud Foundry (at least with version 0.16.0) you will need to raise the size of the disk quota for the container to 1G.  You can do that by using the following to push your app:

cf push SOME_APP_NAME -b https://github.com/cloudfoundry-community/dotnet-core-buildpack -k 1G
[Read more]

Recovering from vCenter Appliance Disk Errors on LVM Devices

Posted On // 1 comment
Let's say you have a ghetto vSphere home lab.

And let's say that you are running a vCenter Appliance to manage that home lab because you didn't want to devote a whole physical machine to that task because you are cheap.

And let's say you are running a small storage server for that home lab that is hosting the disks for that vCenter Appliance.

And let's say that that home storage server is running on a UPS, but _sometimes_ the power goes out for a little bit longer than your UPS can handle and you haven't had the time to configure that file server to shutdown the vSphere hosts before it shuts itself down.

Everything comes back up after the power failure, but your vCenter Appliance VM is complaining about file system errors and won't boot.  How do you fix that?

Well, the good news is that there are some great guides out there to get you part of the way to a solution.  I followed http://www.opvizor.com/blog/vmware-vcenter-server-appliance-vcsa-filesystem-is-damaged/ to get out to a BASH prompt, but the filesystems that I was getting errors for were on LVM volume groups.  And when I went to look for those devices, they weren't showing up under /dev/mapper.

The problem was that those LVM volume groups were not being marked active when I booted up using the method in the procedure above.  Luckily, the commands below allow you to make sure the device nodes get created under /dev/mapper, and then you can run fsck against the failing LVM devices.

(none):/ # modprobe dm_mod
(none):/ # vgscan
  Failed to find sysfs mount point
  Reading all physical volumes.  This may take a while...
  Found volume group "invsvc_vg" using metadata type lvm2
  Found volume group "autodeploy_vg" using metadata type lvm2
  Found volume group "netdump_vg" using metadata type lvm2
  Found volume group "seat_vg" using metadata type lvm2
  Found volume group "dblog_vg" using metadata type lvm2
  Found volume group "db_vg" using metadata type lvm2
  Found volume group "log_vg" using metadata type lvm2
  Found volume group "core_vg" using metadata type lvm2
  Found volume group "invsvc_vg" using metadata type lvm2
(none):/ # vgchange -ay
  Failed to find sysfs mount point
  1 logical volume(s) found in volume group "invsvc_vg" now active
  1 logical volume(s) found in volume group "autodeploy_vg" now active
  1 logical volume(s) found in volume group "netdump_vg" now active
  1 logical volume(s) found in volume group "seat_vg" now active
  1 logical volume(s) found in volume group "dblog_vg" now active
  1 logical volume(s) found in volume group "db_vg" now active
  1 logical volume(s) found in volume group "log_vg" now active
  1 logical volume(s) found in volume group "core_vg" now active
  1 logical volume(s) found in volume group "invsvc_vg" now active
(none):/ # fsck /dev/mapper/log_vg-log
fsck from util-linux 2.19.1
e2fsck 1.41.9 (22-Aug-2009)
...
[Read more]