Data Generation Through Workflow in Synthora#

Workflow is a powerful system in Synthora that can be used to orchestrate agents on solving various problems, allowing users to define their own workflow with a high level of flexibility.

In this tutorial, we will show you how to use workflow to generate data. We will start from the simplest SFT data generation and gradually increase the complexity to COT and ToT data generation. Thanks to the flexibility of workflow, the overall process is simple and can be easily customized.

Now, if you are ready, let’s start!

Prerequisites#

Before jumping into the fun stuff, there are a few things you’ll need to set up. (Hang tight—it’s worth it!)

Install Synthora#

Synthora runs on Python 3.8 or later. You can install it quickly using pip:

[1]:

%pip install synthora

Requirement already satisfied: synthora in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (0.1.1)
Requirement already satisfied: asyncio<4.0.0,>=3.4.3 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (3.4.3)
Requirement already satisfied: docstring-parser<0.17,>=0.16 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (0.16)
Requirement already satisfied: fastapi<0.116.0,>=0.115.5 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (0.115.6)
Requirement already satisfied: openai<2.0.0,>=1.55.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (1.58.1)
Requirement already satisfied: pydantic<3.0.0,>=2.10.1 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (2.10.4)
Requirement already satisfied: rich<14.0.0,>=13.9.4 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (13.9.4)
Requirement already satisfied: websockets<15.0,>=14.1 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from synthora) (14.1)
Requirement already satisfied: starlette<0.42.0,>=0.40.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from fastapi<0.116.0,>=0.115.5->synthora) (0.41.3)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from fastapi<0.116.0,>=0.115.5->synthora) (4.12.2)
Requirement already satisfied: anyio<5,>=3.5.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (4.7.0)
Requirement already satisfied: distro<2,>=1.7.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (0.28.1)
Requirement already satisfied: jiter<1,>=0.4.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (0.8.2)
Requirement already satisfied: sniffio in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (1.3.1)
Requirement already satisfied: tqdm>4 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from openai<2.0.0,>=1.55.0->synthora) (4.67.1)
Requirement already satisfied: annotated-types>=0.6.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from pydantic<3.0.0,>=2.10.1->synthora) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from pydantic<3.0.0,>=2.10.1->synthora) (2.27.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from rich<14.0.0,>=13.9.4->synthora) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from rich<14.0.0,>=13.9.4->synthora) (2.18.0)
Requirement already satisfied: idna>=2.8 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from anyio<5,>=3.5.0->openai<2.0.0,>=1.55.0->synthora) (3.10)
Requirement already satisfied: certifi in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from httpx<1,>=0.23.0->openai<2.0.0,>=1.55.0->synthora) (2024.12.14)
Requirement already satisfied: httpcore==1.* in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from httpx<1,>=0.23.0->openai<2.0.0,>=1.55.0->synthora) (1.0.7)
Requirement already satisfied: h11<0.15,>=0.13 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai<2.0.0,>=1.55.0->synthora) (0.14.0)
Requirement already satisfied: mdurl~=0.1 in /home/lxk/syntropix/Syntropic/.venv/lib/python3.13/site-packages (from markdown-it-py>=2.2.0->rich<14.0.0,>=13.9.4->synthora) (0.1.2)
Note: you may need to restart the kernel to use updated packages.

Import Packages & Set Your API Key#

In this tutorial, we’ll be using OpenAI’s API for data generation. Before we proceed, let’s import the necessary packages and configure the API key:

[1]:

import os
import textwrap
from getpass import getpass
from typing import Any, Dict, List

from synthora.agents import VanillaAgent
from synthora.agents.tot_agent import ToTAgent
from synthora.messages import user
from synthora.messages.base import BaseMessage
from synthora.prompts.buildin import ZeroShotCoTPrompt
from synthora.utils.pydantic_model import get_pydantic_model
from synthora.workflows import task
from synthora.workflows.base_task import BaseTask
from synthora.workflows.scheduler.process_pool import ProcessPoolScheduler
from synthora.workflows.scheduler.thread_pool import ThreadPoolScheduler

[2]:

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key here: ")

Prompt Preparation#

To make comparisons easier, let’s prepare prompts for generating data. A good prompt is the foundation for successful data generation. Take time to think through the kind of data you need and craft your prompt accordingly.

[3]:

problems = [
    "How many letters 'r' in the word 'strawberry'?",
    "9.11 and 9.9, which one is bigger?",
]

Simple SFT Data Generation#

Now we can start to get our hands dirty (Finally!). First, we define a task to generate data using a vanilla agent, which is basically the simplest and most basic, classical form of agents.

We first define two tasks, one of which is to generate data using a vanilla agent (basically just run a query), and the other is to convert the data to the format we want.

What are tasks in Synthora?

Tasks are basic units of workload in workflow. They are the smallest units of work that can be executed independently. In Synthora, tasks are defined using the @task decorator. For more details, please refer to our official documentation of workflow.

[4]:

@task
def generate_simple_data(prompt: str) -> List[BaseMessage]:
    agent = VanillaAgent.default()
    _ = agent.run(prompt)
    return agent.history


@task
def format_data(*resps: List[BaseMessage]) -> List[Dict[str, str]]:
    return [
        {
            "prompt": str(resp[0].content),
            "instruct": str(resp[1].content),
            "response": str(resp[2].content),
        }
        for resp in resps
    ]

Then we can define a workflow to first generate data on each question and format the responses into a readable format like below:

[5]:

flow = ThreadPoolScheduler.map(generate_simple_data, problems) >> format_data
flow.run()

[5]:

[{'prompt': '\nYou are an AI assistant.\n',
  'instruct': "How many letters 'r' in the word 'strawberry'?",
  'response': 'The word "strawberry" contains two letters "r."'},
 {'prompt': '\nYou are an AI assistant.\n',
  'instruct': '9.11 and 9.9, which one is bigger?',
  'response': "The number 9.11 is larger than 9.9. Here's a breakdown:\n\n- 9.11 can be thought of as 9 + 0.11.\n- 9.9 can be thought of as 9 + 0.90.\n\nWhen comparing the decimal parts, 0.11 is less than 0.90; thus, 9.11 is smaller than 9.9. \n\nHence, 9.9 is bigger than 9.11."}]

There we get some simple data that can be used for simple SFT, which is basically just some query and answers without any intermediate steps. The quality of the data, however, could be low and containing mistakes.

Now, what if we want to generate more complex data with better quality? It’s time for Chain of Thoughts (COT) or Tree of Thoughts (ToT) data to kick in.

CoT Data Generation#

Well, it’s not that hard to generate CoT data actually. The only thing we need to do is to change the prompt to ZeroShotCoTPrompt, which is the simplest prompt guiding the agent to output the thinking steps.

We are using ZeroShotCoTPrompt, which is a basical CoT prompts with no examples, for a quick glance here. If you want to generate data with higher quality (or more at your preference), you can augment it with some examples of your own.

[6]:

@task
def generate_cot_data(prompt: str) -> List[BaseMessage]:
    agent = VanillaAgent.default(ZeroShotCoTPrompt)
    _ = agent.run(prompt)
    return agent.history


flow = ThreadPoolScheduler.map(generate_cot_data, problems) >> format_data
flow.run()

[6]:

[{'prompt': '\nSolve the following problem step by step. For each step,\ncarefully explain your reasoning, include all calculations, and state any assumptions you make.\nEnsure that each step logically leads to the next, and provide a clear and concise final answer at the end.\nIf relevant, break the problem into smaller parts and address each part individually before combining the results.\n',
  'instruct': "How many letters 'r' in the word 'strawberry'?",
  'response': "To solve this problem, we need to count the occurrences of the letter 'r' in the word 'strawberry.' \n\nLet's proceed step by step:\n\n1. **Identify the word:** The word we are analyzing is 'strawberry.'\n\n2. **List the letters:** Write out the letters in 'strawberry' to ensure none are missed:\n   - s\n   - t\n   - r\n   - a\n   - w\n   - b\n   - e\n   - r\n   - r\n   - y\n\n3. **Highlight occurrences of 'r':** Now, let's specifically mark each 'r' in the list:\n   - s\n   - t\n   - **r**\n   - a\n   - w\n   - b\n   - e\n   - **r**\n   - **r**\n   - y\n\n4. **Count the 'r's:** There are three letters 'r' in the highlighted list.\n\nTherefore, the word 'strawberry' contains 3 letters 'r'.\n\nThe final answer is: There are 3 'r's in the word 'strawberry.'"},
 {'prompt': '\nSolve the following problem step by step. For each step,\ncarefully explain your reasoning, include all calculations, and state any assumptions you make.\nEnsure that each step logically leads to the next, and provide a clear and concise final answer at the end.\nIf relevant, break the problem into smaller parts and address each part individually before combining the results.\n',
  'instruct': '9.11 and 9.9, which one is bigger?',
  'response': 'To determine which number is bigger, 9.11 or 9.9, we will compare them step by step. A precise approach involves analyzing each digit as they appear number by number.\n\n1. **Identifying the whole number part**:\n   - For 9.11, the whole number part is 9.\n   - For 9.9, the whole number part is also 9.\n\n   Both numbers have the same whole number part (9), so we need to move to the decimal parts to find out which number is larger.\n\n2. **Comparing the decimal part**:\n   - In 9.11, the decimal part is "0.11".\n   - In 9.9, the decimal part is "0.9".\n\n   To compare these, align them as two-digit numbers by rewriting them as follows, making them easier to compare:\n   \n   - Convert 9.11 as "9.11" which means 9 whole and 11 hundredths.\n   - Convert 9.9 to "9.90" which means 9 whole and 90 hundredths.\n\n   Now, when comparing the decimal parts directly:\n   - 0.11 (or 11/100) is less than 0.90 (or 90/100) because 11 is less than 90.\n\n3. **Conclusion**:\n   Since the decimal part of 9.9 is greater than the decimal part of 9.11, 9.9 is the larger number.\n\n**Final Answer**: 9.9 is bigger than 9.11.'}]

Nice, we just got some data containing steps of thinking, which appearently has better quality compared with our first version.

This is not the end, however. Sometimes a single CoT process won’t solve the question we gave. To address this, we can use Tree of Thoughts (ToT).

ToT Data Generation#

ToT data generation will apply the following procedure:

For each step, the agent will generate multiple answers.
Another agent will search through the tree by BFS or DFS to check if there exists a path where the problem has been solved successfully.

ToT will usually improve the success rate of problem solving, and also improve the quality of the data generated. Unfortunately, since ToT applies a new approach of data generation, it won’t be as that easy as CoT, which can be simply done by altering the prompt. But no worries! We got your back.

We offer a ToTAgent in Synthora, which encapsulates all the dirty works for users. In ToTAgent, the tree will be searched with DFS, and we only need to make some configurations like level_size or max_turns here.

[7]:

@task
def generate_tot_data(prompt: str) -> List[BaseMessage]:
    agent = ToTAgent.default(level_size=2, max_turns=15)
    resp = agent.run(prompt)
    if resp.is_err:
        # the problem is not solved successfully
        return []
    return agent.history

Then we can create a even harder question for the agent to solve.

[9]:

hard_question = (
    "Consider a regular octagon. How many different triangles can be formed "
    "if the octagon is placed inside a circle and we can also use the center "
    "of the circle as a vertex for the triangles? Let's think step by step."
)
flow = ThreadPoolScheduler.map(generate_tot_data, problems + [hard_question])
results = flow.run()

# Get the data for the last question
for res in results[-1][1:]:
    print(res.content)

Consider a regular octagon. How many different triangles can be formed if the octagon is placed inside a circle and we can also use the center of the circle as a vertex for the triangles? Let's think step by step.
To solve this problem, we need to count the number of distinct triangles that can be formed using the vertices of a regular octagon and the center of the circle that circumscribes the octagon. We'll consider each possible case step-by-step, ensuring that we count all possible triangles with clarity.

**Step 1: Understand the structure of the octagon within the circle.**

Think: A regular octagon has 8 vertices, and these vertices lie on the circumference of a circle. Additionally, we have the center of the circle, which can be used as an extra vertex. Therefore, we have a total of 9 points (8 on the circle + 1 center) we can use to form triangles.

**Action: List the points.**

Output: The points we have are \( A_1, A_2, A_3, \ldots, A_8 \) (the vertices of the octagon) and \( C \) (the center of the circle), giving us points: \( A_1, A_2, A_3, \ldots, A_8, C \).

Next, let's determine the number of triangles considering these points.
**Step 2: Consider triangles formed entirely by the octagon's vertices.**

Think: First, we should count the triangles formed using only the vertices of the octagon, ignoring the center of the circle for now. Since a triangle needs three vertices and we've got eight vertices from the octagon, we can choose any 3 vertices from these 8 to form a triangle.

The number of ways to choose 3 vertices from 8 is calculated by the combination formula \(\binom{n}{r}\), where \(n\) is the total number of items, and \(r\) is the number of items to choose.

**Action: Calculate the number of combinations of 8 vertices taken 3 at a time.**

Output: The number of triangles formed by the octagon's vertices alone is \(\binom{8}{3} = 56\).

Now, let's move on to considering the inclusion of the center of the circle as a vertex in forming triangles.
**Step 3: Consider triangles that include the center of the circle as one vertex.**

Think: Since we want to form triangles that include the center of the circle (\(C\)) as one of the vertices, we need to combine \(C\) with any two vertices of the octagon to form a triangle. Each such triangle will have \(C\) and two vertices from the set of octagon's vertices.

With \(8\) vertices available, we can choose \(2\) at a time to pair with \(C\).

**Action: Calculate the number of triangles that can be formed by choosing 2 out of the 8 vertices along with the center \(C\).**

Output: The number of triangles formed with the center and two vertices of the octagon is \(\binom{8}{2} = 28\).

Now we have computed triangles that include the center. We must verify that all these triangles are unique compared to those formed by the octagon's vertices alone.
**Step 4: Combine the results and verify uniqueness.**

Think: We have two sets of triangles:
- Triangles that use only the octagon's vertices: 56 triangles.
- Triangles that utilize the center and two octagon vertices: 28 triangles.

Since each triangle in the second set inherently includes the center \(C\) and is not formed by the vertices on the edge of the octagon alone, there is no overlap between these two sets of triangles. Thus, there’s no need to examine further if these triangles coincide with those consisting solely of octagon vertices.

**Action: Calculate the total number of distinct triangles.**

Output: The total number of distinct triangles is \(56 + 28 = 84\).

Thus, the total number of different triangles that can be formed, considering both the octagon’s vertices and the center of the circle, is 84. Let's proceed to verify this solution.

We can see that the hard problem has been solved successfully. We can also check the data generated for the question comparing 9.11 and 9.9, just for comparison with previous approaches like CoT.

[10]:

for res in results[1][1:]:
    print(res.content)

9.11 and 9.9, which one is bigger?
To solve the problem of determining which number is bigger between 9.11 and 9.9, we need to compare these two decimal numbers.

**Think**: The first step in comparing two decimal numbers is to start from the leftmost digit and compare them one by one. If the digits in the same decimal place are equal, we move to the next digit to the right. If one is greater than the other, then that number is larger.

Let's start by comparing the numbers 9.11 and 9.9.

**Output**:

- In the integer part (before the decimal point), both numbers have the digit '9'.
- Next, compare the tenths place, which is the first digit after the decimal point: both numbers have the digit '9'. So, they are equal up to this point.
- Now, look at the hundredths place: 9.11 has '1' and 9.9 can be seen as '9.90', so it has '0'.

Since 1 (from 9.11) is greater than 0 (from 9.9 or 9.90), the number 9.11 is larger than 9.9.

Let’s proceed with comparing the hundredths place and draw the conclusion.

Another Example: Scoring Generated Data#

Now, I believe you already have a brief sense on how to use Synthora to generate simple, CoT and ToT data. At the end of this tutorial, we gonna walk through another case, where we will generate multiple entries of data on the same problem and let one agent to score each of them.

First we can define a function used to score two entries of generated data:

[11]:

def score_response(
    history1: List[BaseMessage], history2: List[BaseMessage], prompt: str
) -> Dict[str, Any]:
    response_format = get_pydantic_model('{"score1": 0.0, "score2": 0.0}')

    system_prompt = textwrap.dedent(
        f"""\
        You are a judge to score responses to the following question, scaling from 0 to 10.

        {prompt}
        """  # noqa: E501
    )
    agent = VanillaAgent.default(system_prompt)
    agent.model.config["response_format"] = response_format

    # Skip the system message in history
    openai_history1 = [msg.to_openai_message() for msg in history1[1:]]
    openai_history2 = [msg.to_openai_message() for msg in history2[1:]]

    _history1 = "\n".join(
        [f"{msg['role']}: {msg['content']}" for msg in openai_history1]
    )
    _history2 = "\n".join(
        [f"{msg['role']}: {msg['content']}" for msg in openai_history2]
    )

    agent.history.append(user(f"Response 1:\n{_history1}"))
    agent.history.append(user(f"Response 2:\n{_history2}"))
    resp = agent.run("Please score the two responses.").unwrap().parsed
    result = {
        "chosen": openai_history1
        if resp.score1 > resp.score2
        else openai_history2,
        "rejected": openai_history2
        if resp.score1 > resp.score2
        else openai_history1,
        "score_chosen": resp.score1
        if resp.score1 > resp.score2
        else resp.score2,
        "score_rejected": resp.score2
        if resp.score1 > resp.score2
        else resp.score1,
    }
    return result

Then we can define a function of generating two entries of data with the workflow like below, where two tasks (agents) will run concurrently trying to solve the same problem.

[12]:

def generate_data(system1: str, system2: str, prompt: str) -> Dict[str, Any]:
    agent1, agent2 = (
        VanillaAgent.default(system1),
        VanillaAgent.default(system2),
    )
    flow = (BaseTask(agent1.run) | BaseTask(agent2.run)).s(prompt)
    _ = flow.run()
    return score_response(agent1.history, agent2.history, prompt)

For each problem, we can have a system message for it. For the consideration of convenience, we will use the same simple message for each problem.

[13]:

system_message = "You are an AI Assistant."
system1 = [system_message for _ in problems]
system2 = [system_message for _ in problems]

Then we can run all the problems concurrently with ProcessPoolScheduler, which can run tasks with multi-process approach.

Here we have two nested parallelism, which is supported by nested workflow:

The outmost workflow works with multi-process, where each problem takes up a process to be solved.

Inside each process, there will also be a workflow, where two agents are trying to solve the same problem concurrently with multi-thread.

[14]:

flow = ProcessPoolScheduler.starmap(
    BaseTask(generate_data), zip(system1, system2, problems)
)
results = flow.run()

Then we can print the result to see the accepted and rejected data to each problem, according to the score:

[15]:

for result in results:
    print(result)

{'chosen': [{'content': "How many letters 'r' in the word 'strawberry'?", 'role': 'user', 'name': 'user'}, {'content': 'The word "strawberry" contains three instances of the letter \'r\'.', 'role': 'assistant', 'name': 'gpt-4o'}], 'rejected': [{'content': "How many letters 'r' in the word 'strawberry'?", 'role': 'user', 'name': 'user'}, {'content': 'The word "strawberry" contains two letters \'r\'.', 'role': 'assistant', 'name': 'gpt-4o'}], 'score_chosen': 10.0, 'score_rejected': 1.0}
{'chosen': [{'content': '9.11 and 9.9, which one is bigger?', 'role': 'user', 'name': 'user'}, {'content': '9.11 is greater than 9.9. When comparing decimal numbers, you start from the left and compare each digit. Both numbers have 9 as the whole number, so you move to the tenths place. Here, both have 9, but in the hundredths place, 11 comes after the 9, making 9.11 larger than 9.9.', 'role': 'assistant', 'name': 'gpt-4o'}], 'rejected': [{'content': '9.11 and 9.9, which one is bigger?', 'role': 'user', 'name': 'user'}, {'content': 'The number 9.11 is bigger than 9.9. When comparing decimal numbers, start by comparing the digits from left to right. Both numbers have 9 as the whole number part, but 9.11 has 11 in the decimal part, whereas 9.9 has 9 in the decimal part (which is equivalent to 9.90 if you extend the decimal). Therefore, 9.11 is greater than 9.9.', 'role': 'assistant', 'name': 'gpt-4o'}], 'score_chosen': 10.0, 'score_rejected': 0.0}

Highlights#

In this tutorial, we walked through the generation of simple SFT data, CoT data, and ToT data, with the support of workflow in Synthora. At last, we introduced an example, where we can generate multiple entries of data concurrently, taking leverage of the support of nested workflow.

About Synthora#

Synthora is a lightweight and extensible framework for LLM-driven Agents and ALM research. It provides essential components to build, test and evaluate agents. At its core, Synthora aims to assemble an agent with a single config, thus minimizing your effort in building, tuning, and sharing agents.

If you find this tutorial interesting, feel free to visit our GitHub Repo and leave a star🌟! Any feedback from you will mean a lot to us.

Data Generation Through Workflow in Synthora

Contents