Enabling & Exploring Stable Defussion – Part 3

Before we dive into the details of this post, let us provide the previous two links that precede it.

Enabling & Exploring Stable Defussion – Part 1

Enabling & Exploring Stable Defussion – Part 2

For, reference, we’ll share the demo before deep dive into the actual follow-up analysis in the below section –


Now, let us continue our discussions from where we left.

class clsText2Image:
    def __init__(self, pipe, output_path, filename):

        self.pipe = pipe
        
        # More aggressive attention slicing
        self.pipe.enable_attention_slicing(slice_size=1)

        self.output_path = f"{output_path}{filename}"
        
        # Warm up the pipeline
        self._warmup()
    
    def _warmup(self):
        """Warm up the pipeline to optimize memory allocation"""
        with torch.no_grad():
            _ = self.pipe("warmup", num_inference_steps=1, height=512, width=512)
        torch.mps.empty_cache()
        gc.collect()
    
    def generate(self, prompt, num_inference_steps=12, guidance_scale=3.0):
        try:
            torch.mps.empty_cache()
            gc.collect()
            
            with torch.autocast(device_type="mps"):
                with torch.no_grad():
                    image = self.pipe(
                        prompt,
                        num_inference_steps=num_inference_steps,
                        guidance_scale=guidance_scale,
                        height=1024,
                        width=1024,
                    ).images[0]
            
            image.save(self.output_path)
            return 0
        except Exception as e:
            print(f'Error: {str(e)}')
            return 1
        finally:
            torch.mps.empty_cache()
            gc.collect()

    def genImage(self, prompt):
        try:

            # Initialize generator
            x = self.generate(prompt)

            if x == 0:
                print('Successfully processed first pass!')
            else:
                print('Failed complete first pass!')
                raise 

            return 0

        except Exception as e:
            print(f"\nAn unexpected error occurred: {str(e)}")

            return 1

This is the initialization method for the clsText2Image class:

  • Takes a pre-configured pipe (text-to-image pipeline), an output_path, and a filename.
  • Enables more aggressive memory optimization by setting “attention slicing.”
  • Prepares the full file path for saving generated images.
  • Calls a _warmup method to pre-load the pipeline and optimize memory allocation.

This private method warms up the pipeline:

  • Sends a dummy “warmup” request with basic parameters to allocate memory efficiently.
  • Clears any cached memory (torch.mps.empty_cache()) and performs garbage collection (gc.collect()).
  • Ensures smoother operation for future image generation tasks.

This method generates an image from a text prompt:

  • Clears memory cache and performs garbage collection before starting.
  • Uses the text-to-image pipeline (pipe) to generate an image:
    • Takes the prompt, number of inference steps, and guidance scale as input.
    • Outputs an image at 1024×1024 resolution.
  • Saves the generated image to the specified output path.
  • Returns 0 on success or 1 on failure.
  • Ensures cleanup by clearing memory and collecting garbage, even in case of errors.

This method simplifies image generation:

  • Calls the generate method with the given prompt.
  • Prints a success message if the image is generated (0 return value).
  • On failure, logs the error and raises an exception.
  • Returns 0 on success or 1 on failure.
class clsImage2Video:
    def __init__(self, pipeline):
        
        # Optimize model loading
        torch.mps.empty_cache()
        self.pipeline = pipeline

    def generate_frames(self, pipeline, init_image, prompt, duration_seconds=10):
        try:
            torch.mps.empty_cache()
            gc.collect()

            base_frames = []
            img = Image.open(init_image).convert("RGB").resize((1024, 1024))
            
            for _ in range(10):
                result = pipeline(
                    prompt=prompt,
                    image=img,
                    strength=0.45,
                    guidance_scale=7.5,
                    num_inference_steps=25
                ).images[0]

                base_frames.append(np.array(result))
                img = result
                torch.mps.empty_cache()

            frames = []
            for i in range(len(base_frames)-1):
                frame1, frame2 = base_frames[i], base_frames[i+1]
                for t in np.linspace(0, 1, int(duration_seconds*24/10)):
                    frame = (1-t)*frame1 + t*frame2
                    frames.append(frame.astype(np.uint8))
            
            return frames
        except Exception as e:
            frames = []
            print(f'Error: {str(e)}')

            return frames
        finally:
            torch.mps.empty_cache()
            gc.collect()

    # Main method
    def genVideo(self, prompt, inputImage, targetVideo, fps):
        try:
            print("Starting animation generation...")
            
            init_image_path = inputImage
            output_path = targetVideo
            fps = fps
            
            frames = self.generate_frames(
                pipeline=self.pipeline,
                init_image=init_image_path,
                prompt=prompt,
                duration_seconds=20
            )
            
            imageio.mimsave(output_path, frames, fps=30)

            print("Animation completed successfully!")

            return 0
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            return 1

This initializes the clsImage2Video class:

  • Clears the GPU cache to optimize memory before loading.
  • Sets up the pipeline for generating frames, which uses an image-to-video transformation model.

This function generates frames for a video:

  • Starts by clearing GPU memory and running garbage collection.
  • Loads the init_image, resizes it to 1024×1024 pixels, and converts it to RGB format.
  • Iteratively applies the pipeline to transform the image:
    • Uses the prompt and specified parameters like strengthguidance_scale, and num_inference_steps.
    • Stores the resulting frames in a list.
  • Interpolates between consecutive frames to create smooth transitions:
    • Uses linear blending for smooth animation across a specified duration and frame rate (24 fps for 10 segments).
  • Returns the final list of generated frames or an empty list if an error occurs.
  • Always clears memory after execution.

This is the main function for creating a video from an image and text prompt:

  • Logs the start of the animation generation process.
  • Calls generate_frames() with the given pipelineinputImage, and prompt to create frames.
  • Saves the generated frames as a video using the imageio library, setting the specified frame rate (fps).
  • Logs a success message and returns 0 if the process is successful.
  • On error, logs the issue and returns 1.

Now, let us understand the performance. But, before that let us explore the device on which we’ve performed these stress test that involves GPU & CPUs as well.

And, here is the performance stats –

From the above snapshot, we can clearly communicate that the GPU is 100% utilized. However, the CPU has shown a significant % of availability.

As you can see, the first pass converts the input prompt to intermediate images within 1 min 30 sec. However, the second pass constitutes multiple hops (11 hops) on an avg 22 seconds. Overall, the application will finish in 5 minutes 36 seconds for a 10-second video clip.


So, we’ve done it.

You can find the detailed code at the GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse.

Till then, Happy Avenging! 🙂