Enabling & Exploring Stable Defussion – Part 3

Before we dive into the details of this post, let us provide the previous two links that precede it.

Enabling & Exploring Stable Defussion – Part 1

Enabling & Exploring Stable Defussion – Part 2

For, reference, we’ll share the demo before deep dive into the actual follow-up analysis in the below section –


Now, let us continue our discussions from where we left.

class clsText2Image:
    def __init__(self, pipe, output_path, filename):

        self.pipe = pipe
        
        # More aggressive attention slicing
        self.pipe.enable_attention_slicing(slice_size=1)

        self.output_path = f"{output_path}{filename}"
        
        # Warm up the pipeline
        self._warmup()
    
    def _warmup(self):
        """Warm up the pipeline to optimize memory allocation"""
        with torch.no_grad():
            _ = self.pipe("warmup", num_inference_steps=1, height=512, width=512)
        torch.mps.empty_cache()
        gc.collect()
    
    def generate(self, prompt, num_inference_steps=12, guidance_scale=3.0):
        try:
            torch.mps.empty_cache()
            gc.collect()
            
            with torch.autocast(device_type="mps"):
                with torch.no_grad():
                    image = self.pipe(
                        prompt,
                        num_inference_steps=num_inference_steps,
                        guidance_scale=guidance_scale,
                        height=1024,
                        width=1024,
                    ).images[0]
            
            image.save(self.output_path)
            return 0
        except Exception as e:
            print(f'Error: {str(e)}')
            return 1
        finally:
            torch.mps.empty_cache()
            gc.collect()

    def genImage(self, prompt):
        try:

            # Initialize generator
            x = self.generate(prompt)

            if x == 0:
                print('Successfully processed first pass!')
            else:
                print('Failed complete first pass!')
                raise 

            return 0

        except Exception as e:
            print(f"\nAn unexpected error occurred: {str(e)}")

            return 1

This is the initialization method for the clsText2Image class:

  • Takes a pre-configured pipe (text-to-image pipeline), an output_path, and a filename.
  • Enables more aggressive memory optimization by setting “attention slicing.”
  • Prepares the full file path for saving generated images.
  • Calls a _warmup method to pre-load the pipeline and optimize memory allocation.

This private method warms up the pipeline:

  • Sends a dummy “warmup” request with basic parameters to allocate memory efficiently.
  • Clears any cached memory (torch.mps.empty_cache()) and performs garbage collection (gc.collect()).
  • Ensures smoother operation for future image generation tasks.

This method generates an image from a text prompt:

  • Clears memory cache and performs garbage collection before starting.
  • Uses the text-to-image pipeline (pipe) to generate an image:
    • Takes the prompt, number of inference steps, and guidance scale as input.
    • Outputs an image at 1024×1024 resolution.
  • Saves the generated image to the specified output path.
  • Returns 0 on success or 1 on failure.
  • Ensures cleanup by clearing memory and collecting garbage, even in case of errors.

This method simplifies image generation:

  • Calls the generate method with the given prompt.
  • Prints a success message if the image is generated (0 return value).
  • On failure, logs the error and raises an exception.
  • Returns 0 on success or 1 on failure.
class clsImage2Video:
    def __init__(self, pipeline):
        
        # Optimize model loading
        torch.mps.empty_cache()
        self.pipeline = pipeline

    def generate_frames(self, pipeline, init_image, prompt, duration_seconds=10):
        try:
            torch.mps.empty_cache()
            gc.collect()

            base_frames = []
            img = Image.open(init_image).convert("RGB").resize((1024, 1024))
            
            for _ in range(10):
                result = pipeline(
                    prompt=prompt,
                    image=img,
                    strength=0.45,
                    guidance_scale=7.5,
                    num_inference_steps=25
                ).images[0]

                base_frames.append(np.array(result))
                img = result
                torch.mps.empty_cache()

            frames = []
            for i in range(len(base_frames)-1):
                frame1, frame2 = base_frames[i], base_frames[i+1]
                for t in np.linspace(0, 1, int(duration_seconds*24/10)):
                    frame = (1-t)*frame1 + t*frame2
                    frames.append(frame.astype(np.uint8))
            
            return frames
        except Exception as e:
            frames = []
            print(f'Error: {str(e)}')

            return frames
        finally:
            torch.mps.empty_cache()
            gc.collect()

    # Main method
    def genVideo(self, prompt, inputImage, targetVideo, fps):
        try:
            print("Starting animation generation...")
            
            init_image_path = inputImage
            output_path = targetVideo
            fps = fps
            
            frames = self.generate_frames(
                pipeline=self.pipeline,
                init_image=init_image_path,
                prompt=prompt,
                duration_seconds=20
            )
            
            imageio.mimsave(output_path, frames, fps=30)

            print("Animation completed successfully!")

            return 0
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            return 1

This initializes the clsImage2Video class:

  • Clears the GPU cache to optimize memory before loading.
  • Sets up the pipeline for generating frames, which uses an image-to-video transformation model.

This function generates frames for a video:

  • Starts by clearing GPU memory and running garbage collection.
  • Loads the init_image, resizes it to 1024×1024 pixels, and converts it to RGB format.
  • Iteratively applies the pipeline to transform the image:
    • Uses the prompt and specified parameters like strengthguidance_scale, and num_inference_steps.
    • Stores the resulting frames in a list.
  • Interpolates between consecutive frames to create smooth transitions:
    • Uses linear blending for smooth animation across a specified duration and frame rate (24 fps for 10 segments).
  • Returns the final list of generated frames or an empty list if an error occurs.
  • Always clears memory after execution.

This is the main function for creating a video from an image and text prompt:

  • Logs the start of the animation generation process.
  • Calls generate_frames() with the given pipelineinputImage, and prompt to create frames.
  • Saves the generated frames as a video using the imageio library, setting the specified frame rate (fps).
  • Logs a success message and returns 0 if the process is successful.
  • On error, logs the issue and returns 1.

Now, let us understand the performance. But, before that let us explore the device on which we’ve performed these stress test that involves GPU & CPUs as well.

And, here is the performance stats –

From the above snapshot, we can clearly communicate that the GPU is 100% utilized. However, the CPU has shown a significant % of availability.

As you can see, the first pass converts the input prompt to intermediate images within 1 min 30 sec. However, the second pass constitutes multiple hops (11 hops) on an avg 22 seconds. Overall, the application will finish in 5 minutes 36 seconds for a 10-second video clip.


So, we’ve done it.

You can find the detailed code at the GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse.

Till then, Happy Avenging! 🙂

Enabling & Exploring Stable Defussion – Part 2

As we’ve started explaining, the importance & usage of Stable Defussion in our previous post:

Enabling & Exploring Stable Defussion – Part 1

In today’s post, we’ll discuss another approach, where we built the custom Python-based SDK solution that consumes HuggingFace Library, which generates video out of the supplied prompt.

But, before that, let us view the demo generated from a custom solution.

Isn’t it exciting? Let us dive deep into the details.


Let us understand basic flow of events for the custom solution –

So, the application will interact with the python-sdk like “stable-diffusion-3.5-large” & “dreamshaper-xl-1-0”, which is available in HuggingFace. As part of the process, these libraries will load all the large models inside the local laptop that require some time depend upon the bandwidth of your internet.

Before we even deep dive into the code, let us understand the flow of Python scripts as shown below:

From the above diagram, we can understand that the main application will be triggered by “generateText2Video.py”. As you can see that “clsConfigClient.py” has all the necessary parameter information that will be supplied to all the scripts.

“generateText2Video.py” will trigger the main class named “clsText2Video.py”, which then calls all the subsequent classes.

Great! Since we now have better visibility of the script flow, let’s examine the key snippets individually.


class clsText2Video:
    def __init__(self, model_id_1, model_id_2, output_path, filename, vidfilename, fps, force_cpu=False):
        self.model_id_1 = model_id_1
        self.model_id_2 = model_id_2
        self.output_path = output_path
        self.filename = filename
        self.vidfilename = vidfilename
        self.force_cpu = force_cpu
        self.fps = fps

        # Initialize in main process
        os.environ["TOKENIZERS_PARALLELISM"] = "true"
        self.r1 = cm.clsMaster(force_cpu)
        self.torch_type = self.r1.getTorchType()
        
        torch.mps.empty_cache()
        self.pipe = self.r1.getText2ImagePipe(self.model_id_1, self.torch_type)
        self.pipeline = self.r1.getImage2VideoPipe(self.model_id_2, self.torch_type)

        self.text2img = cti.clsText2Image(self.pipe, self.output_path, self.filename)
        self.img2vid = civ.clsImage2Video(self.pipeline)

    def getPrompt2Video(self, prompt):
        try:
            input_image = self.output_path + self.filename
            target_video = self.output_path + self.vidfilename

            if self.text2img.genImage(prompt) == 0:
                print('Pass 1: Text to intermediate images generated!')
                
                if self.img2vid.genVideo(prompt, input_image, target_video, self.fps) == 0:
                    print('Pass 2: Successfully generated!')
                    return 0
            return 1
        except Exception as e:
            print(f"\nAn unexpected error occurred: {str(e)}")
            return 1

Now, let us interpret:

This is the initialization method for the class. It does the following:

  • Sets up configurations like model IDs, output paths, filenames, video filename, frames per second (fps), and whether to use the CPU (force_cpu).
  • Configures an environment variable for tokenizer parallelism.
  • Initializes helper classes (clsMaster) to manage system resources and retrieve appropriate PyTorch settings.
  • Creates two pipelines:
    • pipe: For converting text to images using the first model.
    • pipeline: For converting images to video using the second model.
  • Initializes text2img and img2vid objects:
    • text2img handles text-to-image conversions.
    • img2vid handles image-to-video conversions.

This method generates a video from a text prompt in two steps:

  1. Text-to-Image Conversion:
    • Calls genImage(prompt) using the text2img object to create an intermediate image file.
    • If successful, it prints confirmation.
  2. Image-to-Video Conversion:
    • Uses the img2vid object to convert the intermediate image into a video file.
    • Includes the input image path, target video path, and frames per second (fps).
    • If successful, it prints confirmation.
  • If either step fails, the method returns 1.
  • Logs any unexpected errors and returns 1 in such cases.
# Set device for Apple Silicon GPU
def setup_gpu(force_cpu=False):
    if not force_cpu and torch.backends.mps.is_available() and torch.backends.mps.is_built():
        print('Running on Apple Silicon MPS GPU!')
        return torch.device("mps")
    return torch.device("cpu")

######################################
####         Global Flag      ########
######################################

class clsMaster:
    def __init__(self, force_cpu=False):
        self.device = setup_gpu(force_cpu)

    def getTorchType(self):
        try:
            # Check if MPS (Apple Silicon GPU) is available
            if not torch.backends.mps.is_available():
                torch_dtype = torch.float32
                raise RuntimeError("MPS (Metal Performance Shaders) is not available on this system.")
            else:
                torch_dtype = torch.float16
            
            return torch_dtype
        except Exception as e:
            torch_dtype = torch.float16
            print(f'Error: {str(e)}')

            return torch_dtype

    def getText2ImagePipe(self, model_id, torchType):
        try:
            device = self.device

            torch.mps.empty_cache()
            self.pipe = StableDiffusion3Pipeline.from_pretrained(model_id, torch_dtype=torchType, use_safetensors=True, variant="fp16",).to(device)

            return self.pipe
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            torch.mps.empty_cache()
            self.pipe = StableDiffusion3Pipeline.from_pretrained(model_id, torch_dtype=torchType,).to(device)

            return self.pipe
        
    def getImage2VideoPipe(self, model_id, torchType):
        try:
            device = self.device

            torch.mps.empty_cache()
            self.pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, torch_dtype=torchType, use_safetensors=True, use_fast=True).to(device)

            return self.pipeline
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            torch.mps.empty_cache()
            self.pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, torch_dtype=torchType).to(device)

            return self.pipeline

Let us interpret:

This function determines whether to use the Apple Silicon GPU (MPS) or the CPU:

  • If force_cpu is False and the MPS GPU is available, it sets the device to “mps” (Apple GPU) and prints a message.
  • Otherwise, it defaults to the CPU.

This is the initializer for the clsMaster class:

  • It sets the device to either GPU or CPU using the setup_gpu function (mentioned above) based on the force_cpu flag.

This method determines the PyTorch data type to use:

  • Checks if MPS GPU is available:
    • If available, uses torch.float16 for optimized performance.
    • If unavailable, defaults to torch.float32 and raises a warning.
  • Handles errors gracefully by defaulting to torch.float16 and printing the error.

This method initializes a text-to-image pipeline:

  • Loads the Stable Diffusion model with the given model_id and torchType.
  • Configures it for MPS GPU or CPU, based on the device.
  • Clears the GPU cache before loading the model to optimize memory usage.
  • If an error occurs, attempts to reload the pipeline without safetensors.

This method initializes an image-to-video pipeline:

  • Similar to getText2ImagePipe, it loads the Stable Diffusion XL Img2Img pipeline with the specified model_id and torchType.
  • Configures it for MPS GPU or CPU and clears the cache before loading.
  • On error, reloads the pipeline without additional optimization settings and prints the error.

Let us continue this in the next post:

Enabling & Exploring Stable Defussion – Part 3

Till then, Happy Avenging! 🙂