top of page

Generating Sample Data with Microsoft Copilot

ree

Use of real data for testing, training, and development scenarios is problematic for security, integrity, and ethical considerations.  This is especially true for copying production data. 


On the other hand, good test data is important for validating new and unexpected scenarios, and training is far more effective if the training data aligns with what participants experience in the real world.


Copilot and similar tools are known as generative artificial intelligence because of their ability to use complex mathematical techniques and statistics to simulate the expected outputs given a set of inputs, for example prompts.  This makes it ideal for crafting and generating sample data to align to development, testing, and training scenarios.


This is even more pertinent when considering the limitations of human authoring of sample data.  People are not good random data generators.  Large volumes of data are time consuming to generate and the automation of the process tends to introduce systematic patterns that aren’t reflective of real situations.  In addition, people can unintentionally reflect live data in their test sets.  At worst, production data is copied with a light mask on top.


This blog is based on exploring the techniques described for generating sample data in the Microsoft article Use Copilot to create sample data for Field Service.


Step 1 - Define your template(s)


ree

As with any data project, the first and most important step is to determine the tables and columns or attributes.


For generative data, this includes any specific format requirements but also ‘realism’ characteristics of the data.


Example considerations for formatting:

  • Email address requirements, for example first.last@domain.com vs lastfirstinitial@domain.com

  • Phone number formatting (with or without country code)

  • Special characters exclusion or inclusion, including if there are requirements to test the capability of the test system to handle special or unexpected characters

  • Number formats, including variations such as commas, symbols (eg currency), negatives, and/or random quantity of decimals


Example realism characteristics:

  • Should place names map to real place names, or be fake?

  • What geographical areas should be considered?

  • Should GPS coordinates correspond to real street addresses, but the street address columns in the sample data will be fake?

  • Is there some level of grouping to apply, for example a maximum of 5 unique job titles with between 3 and 10 employees per job category?


Note that one of the benefits of prompt chaining is that the upfront requirements definition does not need to be exhaustive or exact.  As long as the prompt is in the right general area, then the actually sample data set can be iterated over multiple prompts.


Step 2 - Create a prompt


ree

Describe the data template in words, including the characteristics of the data.


Include a request for how many rows of data.  Note there will be a limit on the number of rows that can be generated at a time, so start with a smaller number of rows to refine the prompt and then use iteration to generate more rows if needed.


Some considerations:

  • Most LLMs allow upload of examples.  You can use this feature to upload the column headers such as via a csv template. 

  • No non-public or production data should be included anywhere, especially as example data for the prompt.  The objective of the exercise is to avoid any live data getting into the samples.  The challenge is to describe what the data looks like without giving anything away.

  • As well as refining the sample data requirements, iterative prompting is a good way to step Copilot through the sample data generation rather than attempting a single massive prompt.


Step 3 – Review the results and iterate


ree

Review the output of the initial prompt and how well it matches to the desired characteristics.

  • Did you ask for town names but the samples vary from 3 person towns to 1 million person cities?  Maybe specify that the only town names that can be used are those with a population between 10,000 and 100,000 people.

  • LLMs can often use public information sources as examples, for example there may be a careers site with a blog of common job titles or even a jobs site

  • Is there too much variety in values for a column that should be more uniform, such as a company department name?  Maybe specify a maximum number of unique values.


If the initial dataset is looking good, ask Copilot to generate a new quantity of rows following the same characteristics.


Step 4 – Record the requirements, prompts, samples and lessons learned


ree

Sometimes sample data can be reused multiple times, for example to run the same tests, but often there are requirements to generate similar but different datasets based on feedback or to cover different scenarios.


Following the above process for first time sample data generation can take hours. However, by recording the requirements, prompts, samples, and the quality of the generated data, the same pattern can then be used for subsequent projects to rapidly generate new datasets.  Prompt chaining allows for tweaks to get the desired outcome, rather than needing to start the whole process from the beginning.


Example scenario


A Microsoft 365 tenant to validate some scenarios in Microsoft Teams.  This is a trial, so we want to use fake data only.


Example Step 1 – Define the templates


Microsoft helpfully provides the ability to upload user accounts, including a template csv.


ree

There are some additional goals:

  • The trial tenant allows for up to 25 users.  I am already an admin, so I can add 24 other accounts

  • User ids need to be an email style format

  • Workers should be located around Sydney, NSW, because we want to test collaboration scenarios and it’s helpful to ground the dataset

  • I’ll include limits on the number of departments and job titles, because a lesson from previous prompts is that there is too much variety in naming if there’s no constraints


Example Step 2 – Create a prompt


Prompt sample:


Create a CSV format list of 24 users with the following fields: concatenation of First name and Last name and @ and CSMTrial.onmicrosoft.com = username, fake first name = First name, fake surname = Last name, concatenation of First name and a space and Last name = Display name, fake job title for a facility worker = Job title, fake department related to the job title = Department, blank field = Office number, blank field = Office phone, blank field = Mobile phone, blank field = Fax, concatenate username and 1 and @ and example.com = Alternate email address, fake street address = Address, real Sydney suburb = City, NSW = State or province, postcode of the suburb = ZIP or postal code, Australia = Country or region. There should be a maximum of 5 unique suburb names. There should be a maximum of 4 unique departments. There should be a maximum of 8 unique job titles.


Let’s break it down


What: Create a CSV format


How much: 24 users


Fields: I’ve used a pattern of a comma separated list of data description = name of the column. Some examples you’ll see in this prompt:

  • Real vs Fake

  • Blank columns

  • Concatenation of values

  • Re-use of values, for example First name is in the username, first name, and full name columns

  • Geographical grounding, for example Sydney suburbs

  • Constant values, for example State and country

  • Dependencies, for example postcode should be the postcode of the suburb

Constraints

  • Maximum unique departments, job titles etc


Example:

ree

In Progress:

ree

Output:

ree

Copy and paste to Excel (csv):

ree

Example Step 3 - Review the results and iterate


Iteration 1 – the data set is too uniform, with 6 users per suburb. 


ree

Let’s mix it up:


ree

Our new distribution:


ree

Iteration 2 – we want to create a Departments table, with the department name and a description its purpose


ree

The full output:


departmentName,departmentDescription


Facilities,"Responsible for the upkeep, repair, and functionality of all physical infrastructure. Includes HVAC systems, lighting, plumbing, and grounds maintenance. Staff perform inspections, manage work orders, and ensure compliance with safety codes. Supports 24/7 operations with rotating shifts and emergency response protocols. Tools used: CMMS software, PPE, and diagnostic equipment."


Operations,"Oversees daily cleaning, janitorial services, and logistical support across all facilities. Tasks include waste management, floor care, restroom sanitation, and supply inventory. Staff follow strict hygiene protocols and use eco-friendly products. KPIs include % of area cleaned per shift, # of incidents reported, and time-to-resolution metrics. Operates under ISO 9001 standards."


Security,"Provides 24/7 surveillance, access control, and incident response. Monitors CCTV systems, patrols premises, and enforces visitor policies. Staff are trained in conflict de-escalation, emergency evacuation, and first aid. Uses RFID badges, alarm systems, and secure logging tools. Reports include # of breaches, response times, and asset protection metrics."


Example Step 4 – Record the requirements, prompts, samples and lessons learned


Thanks for reading this blog post!

Comments


Contact

CSM BUSINESS AND MOBILITY

Suite 87, Level 1

93 George Street

Parramatta, NSW, 2150

Australia

Phone:

02 7239 8862

Email:

info@csmbusinessandmobility.com

LinkedIn: 

    • LinkedIn

    Subscribe to Get Our Newsletter

    Thanks for submitting!

    © 2025 by CSM Business and Mobility.  ABN 56 816 884 882.  Privacy Policy.

    bottom of page